Retraining Inception-v3 neural network for a new task with Tensorflow

This post is a work log for taking a pre-trained Inception-v3 network and repurpose it to colorize a grey scale image. The idea is based on this paper.


  • Prepare the dataset: Convert training images from JPEG to HSV values. The input is V and target is HS.
  • Train a scene classification network using the Places365 data. Use a pre-trained Inception-v3 image classification model.
  • Fuse the classification network feature with mid-level feature layers to become “Fusion Layer”, then build a colorization network.
  • Compute MSE between the predicted HS values and the actual HS values
  • Generative adversarial network to generate realistic looking colorized images

[log][20170131] I did not end up training the hue and saturation. I was, however, able to generate colorized images although the images are colored green-blue-brownish and desaturated. Apparently this is a common problem. I am now training a deep convolutional generative adversarial network (DCGAN) so that the colorization part uses a discriminator network instead of a mean squared error as the cost function.

[log][20170126] I left the network to train for almost 2 days. The training errors are going up and down, and it colored the image all green. I don’t want to make the network more complex yet, so I just added batch normalization layers, tweaked the activation functions, and retrained. I left it to train overnight and this morning its training error is already lower than before. This morning I have a new idea. Perhaps I should have one branch to train the hue and another branch to train the saturation. They can share the fusion layer but have different colorization layers.

[log][20170124] I have built the colorization model according to the paper. However our models are not identical because for the classification part I am using Inception-v3, therefore the shape of the low level feature and mid level feature are different from that used in the paper. Initially my training error was not going down, that’s because of a implement error and I have set the learning rate too high. After I addressed those issues the training error is steadily going down.

[log][20170110] inception/slim/ defines the inception model.

[log][20170108] inception-v3 takes an arbitrarily sized image, crop it 87.5%, resize back to original size then resize down to 299×299.

[log][20170108] Realized that the downloaded inception model is hardcoded to have batch size of 1. That won’t do for training. It is not feasible to change it. I found a newer version of inception saved in March 2016 that uses checkpoint instead of pb. That should be able to let me change the batch size arbitrarily.

[todo][20170107] figure out how to connect the pre-trained model to another network

[log][20170107] I played with inception. Since my input will be grayscale, the greatest concern is whether inception plays well with grayscale. Yes, it still classify a cat as a cat. Good enough.

[todo][20170103] TensorFlow inception is already an image classifier. Can I take this one and use it? I was working on my own data loader and model because Places365’s pre-built models were not built with TensorFlow. However, inception is trained on color images, not greyscale. That said, can I take the code and make my own modifications to adapt it to greyscale images, then train from scratch? Probably much faster than building everything from scratch on my own. Seems like the answer is yes.

[todo][20170103] Improve jpeg->training data performance. Utilize more CPU? Seems like this is not necessary. TensorFlow can process images in a background thread while training. See: Reading Data.

[log][20170103] Wondered about reading from jpeg then convert to inputs and targets each time vs converts all jpeg and write all the results to disk, then load a batch at a time during training. Did some quick calculations: input = 256*256*1*4 bytes, target = 128*128*2*4 bytes. Roughly 630GB for the 1.6 million images data set. This is just not feasible with my hardware without spending more money! Instead, I need to spend more time to figure out how to convert the jpeg to training data faster. Currently on my laptop, it takes 140 seconds to convert 5000 images. I noticed that only 30% of the CPU was used while it is doing the conversion.

[todo][20170103] Figure out what’s the best way to persist the preprocessed X and Y – not going to do this due to disk space limitation. See log above.

[log][20170103] Compared original vs upsampled vs only upsampled HS channels combined with full res V channel. As expected, visually, upsampling only the HS channels shows tolerable visual degradation. DataProcessor now downsamples HS channels to config.output_size.

[todo][20170103] Test run preprocessing a large batch of training jpg images.

[todo][20170102] Verify that my X and Y outputs are correct. Put them back together, convert back to RGB, and see that image looks right?

[todo][20170102] Downsample Y. In the paper they predict the color values at a lower resolution then upsample it.

[todo][20170102] Remove hardcoded limit in Currently hardcoded to only list 10 files.

[log][20170102] Decompressing the dataset took a long time… Found an index of categories on places365 github. Worked on DataPreprocessor: loading jpg to rgb then convert to HSV and split into inputs (V) and targets (HS).

[todo][20161231] Pickle the dataset – not going to do this. Not feasible due to disk limitation. See log on 20160103

[log][20161231] I have decided to use the “small” dataset from Places365. It is still going to be 24GB for the training data. In the paper, they converted images to LAB. However, I need an efficient way to create my dataset, and TensorFlow already provide methods to convert colorspace between RGB and HSV, and to decode JPEG. I want to focus on building the neural net so this seems a reasonable compromise.


Notes on Recurrent Neural Networks

Recurrent neural nets have states, unlike feed-forward networks. An analogy for RNN is the C strtok function, where calling it with the same parameter typically yields a different value (but of course, unlike strtok, RNN does not modify the input). An analogy for feed-forward networks is a function in the mathematical sense, where y=f(x) regardless of how many times it was called.

At first I thought what makes RNN special is that it uses its own output as part of its input. While that’s true, after more reading, it seems that the magic really is the cell state. The cell state in an RNN is updated each time it processes the input. Using the strtok analogy, it is like how strtok updates its internal position of the last token each time strtok is called, so the next time you call it, it returns the next token.

So RNN is like a program whereas a feed-forward network is like a function.

An audio dataset and IPython notebook for training a convolutional neural network to distinguish the sound of foosball goals from other noises using TensorFlow


IPython notebook: CNN on Foosball sounds.ipynb
Trained CNN model using TensorFlow: model.ckpt
Pickled Pandas dataframe: full_dataset_44100.pickle


I setup mics at the foosball table and recorded a few hours of foosball games. The audio files were labelled by hand and then segmented into one-second clips of goals / other noises. Mel spectrograms were created from the clips and about 200 samples were created and used for training, testing, and validation, resulting in 5% error on test data.

Data collection and labelling

I used a Zoom H5 XY stereo mic, a Shure SM57, and a few other mics for recording. Each mic had its own characteristic and they were placed at different locations around the table, for instance, pointing close to a goalie, high above the table pointing downward, or from one side of the table pointing at a goalie at the far side. There might be enough differences between each track to improve the model. The audio was recorded in 16-bit wav format at 44.1kHz.

The audio files were labelled manually by playing each file in Audacity at 2x speed and then a label was added on the timeline whenever a goal was heard. Audacity has a function to export the labels to a text file. There was typically half a second to two seconds lag between my labels and the actual foosball goal, so I did a second pass through each label to fine-adjusted them. Adjusting the labels allowed me to use a shorter audio clip as the input to the model later.

For the non-goal samples, I simply took the middle of two goals.

When I recorded the audio, I adjusted the gains such that each mic is more or less at the same level. I did not do any post processing, not even noise reduction.

Notes on dealing with audio data in Python

As mentioned earlier the audio was recorded in 16-bit wav format at sample rate 44.1kHz. 44.1kHz means sound is sampled 44100 times per second. Each sample represents the amplitude of the sound wave at that instance. 16-bit is the bit depth of the samples.

Using librosa to load audio data in Python:

import librosa
y, sr = librosa.core.load("path_to_file")

y is a numpy array of the audio data. sr is the sample rate.

Since sample rate is the number of samples per second, this returns a segment between 00:01 and 00:02:

segment = y[1*sr:2*sr]

Then we can create an audio control to play the clip in IPython notebook:

import IPython.display
from IPython.display import Audio

I wrote some code to read the timestamps and segment the original audio files. I have also pushed the final resulting dataset to GitHub.

I used librosa to create some additional features such as mel spectrograms. It seems to work better than the original waveform for training a neural network.


Loading the ready-to-use dataset

Load full_dataset_44100.pickle:

ds = pandas.read_pickle("full_dataset_44100.pickle")

Play a a clip:

IPython.display.display(IPython.display.Audio(data=ds.iloc[0]["data"], rate=44100))

Building the convolutional neural network

I used the TensorFlow MNIST example as my template but instead of doing mini batches, I used the entire training set for each iteration because there are only 160 samples in the training set. The first neural net I built used 11.025kHz waveform as the input but had 15% error on test data. Then I trained another one using mel spectrograms as the input which yielded better results. After training for 200 iterations, the CNN had 5% error on the test data. It took 15 minutes to train the model on my laptop with i5-6200U CPU @2.30GHz and 8GB ram, in a docker container in a VirtualBox Ubuntu VM in a Windows 10 host!

There are two convolution layers / ReLU / max pooling, two fully connected layers, and softmax as the output.

The IPython notebook is available at along with the pickled dataset and the original recording encoded in mp3 format.


There are some more foosball recordings which I have not labelled. Those data could be used. Would appreciate it if anyone would like to label those foosball games.

The full_dataset_44100.pickle only uses the TrLR track (i.e. the XY stereo mics) and they were prepared from the original WAV files instead of the mp3s. I am interested to know if it would provide any improvement by training with the Tr1 and Tr2 tracks too. They recorded the same foosball games but the mics were placed at different positions and pointed at different directions.