Chord Intelligence mkIV: Training the Deep Neural Network
• Chris Liscio
• Chris Liscio
The software that trains Capo’s chord detection engine can learn from hundreds of songs per minute, and chews through more than a month worth of audio in a little over an hour. This throughput is possible thanks to a combination of GPU hardware, and the data pump that keeps it busy.
Using Apple’s MPS (Metal Performance Shaders) neural network APIs, I only need to describe a deep neural network that detects chords. Put another way, I just have to “sketch out” a high-level picture of the machinery that I need—the system handles all the math for me, and ensures that it runs efficiently on the GPU. With the network sketched out, I just need to worry about sending data to the API, and waiting for its calculations to complete.
This sounds rather straightforward, but in my situation it turned out to be quite a bit more complicated to pull this off. The simplified API had some missing pieces that forced me to replace parts of my “sketch” using lower-level APIs. I’ll spare you the details, but at times I found myself neck-deep in some advanced calculus that I’d only briefly encountered in school. (Partial derivatives of equations involving matrices? Ugh!)
Once I finally got the neural network all set up, training it was a matter of feeding the network with input data and labels. In my case, the input data consists of a spectral representation of the song’s audio (i.e. something spectrogram-like), and the labels contain the chords that we know are present in the spectral input.
Using this data, the neural network performs a forward pass and a backward pass. The forward pass consists of the part that ships with Capo—the song’s audio spectrum goes in, and the most likely chord comes out. In the backward pass, the result is compared against the label data (i.e. the correct chords) to perform adjustments to the neural network so that it gets a tiny bit better at its job.
Amazingly, this trip through the neural network involves millions of calculations. It all happens in the blink of an eye, and the process is repeated for every song in the training data set. The general idea here is that the neural network adjusts itself slowly with each song that it encounters, and the accuracy of its chord results improve gradually over time.
If you’re not already familiar with neural networks, I highly recommend that you check out Grant Sanderson’s video series on YouTube. I think he does an excellent job explaining the basics of how this all works.
In the last post, I spoke about the database files that get prepared before the training software runs. Without the tools that prepare the databases in advance, the GPU would spend most of its time sitting idle as it waits for the CPU to translate the audio and label files into a suitable format for the neural network.
While training a neural network, it’s common to send batches of training data rather than individual items. This practice can help the network to learn more effectively, but it also helps to improve performance by reducing the time wasted sending data between the CPU and the GPU.
To prepare a batch of data, the training software selects a number of songs at random from the training database and copies them into GPU-friendly data structures. This happens fairly quickly, but the GPU should not have to sit around waiting as its next batch of work is fetched from the database.
To overcome this, the training software begins by priming the pump at the start of each training epoch (i.e. one trip through the training data set). More specifically, a large, in-memory queue gets filled with these batches of data from the database. Once this data pump is primed, a thread gets kicked off with the sole purpose of keeping the queue filled with data.
As training runs, batches of data are pulled from the front of the queue and submitted immediately to the GPU. I use a double-buffering scheme to ensure that the GPU never has to wait for its next batch—I’m preparing it while the GPU is working on the current one.
Using this scheme, I was able to ensure GPU utilization in the high 90% range. I finally got to hear the fan in my iMac Pro!
Interestingly, there’s no such thing as a fully trained neural network. For various reasons, neural networks get trained until some condition is met.
In some cases, I might choose to stop training because the neural network isn’t getting any better at identifying chords in the validation data set (a.k.a. early stopping). In other cases, I just want to train for a fixed number of epochs.
But how do I decide which of these strategies to take? How do we quantify that the neural network “isn’t getting any better”?
In the next post, I’m going to talk about the high-level driver in the training software that controls the training process.