IPython notebook: CNN on Foosball sounds.ipynb
Trained CNN model using TensorFlow: model.ckpt
Pickled Pandas dataframe: full_dataset_44100.pickle
I setup mics at the foosball table and recorded a few hours of foosball games. The audio files were labelled by hand and then segmented into one-second clips of goals / other noises. Mel spectrograms were created from the clips and about 200 samples were created and used for training, testing, and validation, resulting in 5% error on test data.
Data collection and labelling
I used a Zoom H5 XY stereo mic, a Shure SM57, and a few other mics for recording. Each mic had its own characteristic and they were placed at different locations around the table, for instance, pointing close to a goalie, high above the table pointing downward, or from one side of the table pointing at a goalie at the far side. There might be enough differences between each track to improve the model. The audio was recorded in 16-bit wav format at 44.1kHz.
The audio files were labelled manually by playing each file in Audacity at 2x speed and then a label was added on the timeline whenever a goal was heard. Audacity has a function to export the labels to a text file. There was typically half a second to two seconds lag between my labels and the actual foosball goal, so I did a second pass through each label to fine-adjusted them. Adjusting the labels allowed me to use a shorter audio clip as the input to the model later.
For the non-goal samples, I simply took the middle of two goals.
When I recorded the audio, I adjusted the gains such that each mic is more or less at the same level. I did not do any post processing, not even noise reduction.
Notes on dealing with audio data in Python
As mentioned earlier the audio was recorded in 16-bit wav format at sample rate 44.1kHz. 44.1kHz means sound is sampled 44100 times per second. Each sample represents the amplitude of the sound wave at that instance. 16-bit is the bit depth of the samples.
Using librosa to load audio data in Python:
import librosa y, sr = librosa.core.load("path_to_file")
y is a numpy array of the audio data. sr is the sample rate.
Since sample rate is the number of samples per second, this returns a segment between 00:01 and 00:02:
segment = y[1*sr:2*sr]
Then we can create an audio control to play the clip in IPython notebook:
import IPython.display from IPython.display import Audio IPython.display.display(IPython.display.Audio(segment))
I wrote some code to read the timestamps and segment the original audio files. I have also pushed the final resulting dataset to GitHub.
I used librosa to create some additional features such as mel spectrograms. It seems to work better than the original waveform for training a neural network.
Loading the ready-to-use dataset
ds = pandas.read_pickle("full_dataset_44100.pickle")
Play a a clip:
Building the convolutional neural network
I used the TensorFlow MNIST example as my template but instead of doing mini batches, I used the entire training set for each iteration because there are only 160 samples in the training set. The first neural net I built used 11.025kHz waveform as the input but had 15% error on test data. Then I trained another one using mel spectrograms as the input which yielded better results. After training for 200 iterations, the CNN had 5% error on the test data. It took 15 minutes to train the model on my laptop with i5-6200U CPU @2.30GHz and 8GB ram, in a docker container in a VirtualBox Ubuntu VM in a Windows 10 host!
There are two convolution layers / ReLU / max pooling, two fully connected layers, and softmax as the output.
The IPython notebook is available at https://github.com/dk1027/ConvolutionalNeuralNetOnFoosballSounds along with the pickled dataset and the original recording encoded in mp3 format.
There are some more foosball recordings which I have not labelled. Those data could be used. Would appreciate it if anyone would like to label those foosball games.
The full_dataset_44100.pickle only uses the TrLR track (i.e. the XY stereo mics) and they were prepared from the original WAV files instead of the mp3s. I am interested to know if it would provide any improvement by training with the Tr1 and Tr2 tracks too. They recorded the same foosball games but the mics were placed at different positions and pointed at different directions.