Chord Intelligence mkIV: The Driver
• Chris Liscio
• Chris Liscio
There is a high-level component in my training software that is responsible for executing each of the experiments that I need to perform in my chord detection research. I call it the driver, because it drives the training process on my behalf.
Every time I run the full training process, I generate an entirely new chord detector. I can create different chord detectors by changing the training data set, trying a different neural network architecture, or by supplying different parameters that adjust how the training is executed.
Each of these training runs represents a single experiment that requires further evaluation. My aim is to develop an understanding of how each variable (the data set, the architecture, and parameters) affects the outcome (the chord detector.)
It takes a *very long time** to train a new chord detector, and the process consumes an awful lot of (electrical) energy. So it’s important that each of my experiments is executed carefully, and no effort is wasted.
That’s what makes the driver such an important component of the training software. It turns my recipe for a given experiment into a collection of useful data (including the chord detector itself) that I can use to make informed decisions about how I should proceed.
Specifically, the driver does the following for me.
It configures the neural network’s optimizer, which is responsible for making changes to the network based on the incoming training data.
It decides when to stop training, so that we don’t train for longer than is necessary.
It retains as much data as possible for later analysis.
It ensures that my experiments are repeatable by controlling the generation of (pseudo-)random numbers.
The rest of this post will explain each of these responsibilities in depth. So strap in, because there’s a lot to cover!
In the last post I wrote about the forward and backward passes through the neural network. There, I made it seem as though the neural network was capable of learning from the training data, which is not the case.
A neural network consists of a complex arrangement of rather simple mathematical components—really, just a massive collection of multiplications and additions performed on the network’s input and a collection of numbers that represent what the network can do. These numbers are also known as the network’s weights and biases, and changes to these values will influence the output of the neural network for a given input.
Since we’re talking about a neural network that detects chords, then our input is a slice of musical audio, and the output can be interpreted as C:maj
. If changes are made to the weights and biases of our neural network, the same slice of audio might instead produce an F:min7
.
A neural network on its own is incapable of changing its weights and biases, and hence it cannot really learn anything on its own. Instead, we rely on an optimization algorithm to make changes to the neural network’s weights & biases based on the labels (i.e. the correct chords) that we supply during training.
There are a number of different optimization algorithms available such as ADAM, SGD (Stochastic Gradient Descent), and many variations of the same. Each algorithm has its own set of parameters that controls how learning should progress. Simplifying greatly, these parameters control how drastically the network will be adjusted in response to a new training example.
Interestingly, a poorly chosen algorithm and parameters can result in a chord detector that doesn’t appear to work at all. For example, I could specify values that result in the network failing to reach 20% accuracy after a day of training.
So what does all of this business about optimization have to do with the driver?
Not much, really! What’s important to understand here is that the driver allows me to specify the optimization algorithm and its associated parameters (a.k.a. hyperparameters) so that I can measure their significance later. I’ll speak more about hyperparameters in the future when I discuss evaluation.
Just like with the optimization algorithms, there are also algorithms that define what I’ll call the training strategy. I can specify which training strategy that I want the driver to execute, and the parameters for my chosen strategy that defines when training should stop.
Right now, the driver gives me two strategies to choose from.
The first strategy just counts epochs (i.e. a trip through the training data set.) When I specify this strategy to my training software, there is no need for separate training and validation data sets. The neural network just runs an epoch, then another epoch, and so on until the specified number of epochs has passed. Easy!
Unfortunately, epoch counting alone is like flying blind. If you allow the network to train for too long, you run the risk of overfitting your model to your training data. To relate this to the chord detector, imagine a chord detector that was 99.9% accurate at recognizing chords in 50 of my favorite songs, but completely unable to recognize a single chord in your favorites. We don’t want that!
The second strategy—early stopping—is more complex. It calculates accuracy along the way, and makes an informed decision about whether or not training should proceed.
Here, the training data set is used for training, and accuracy is measured against the validation data set. If accuracy does not improve, training stops. Remember: the validation data set represents items (i.e. songs) that the neural network has not seen during training.
At first, this sounds obvious—possibly even a little silly. Of course training should stop if accuracy isn’t getting better! But you might wonder, “what’s the harm if I just let training run for a few days?”
Interestingly, what will likely happen is that the accuracy on the validation data set will reach a peak, and then start to fall. This is a sign of overfitting—the neural network will continue to get very good at detecting chords in the training data set as it gets worse on the validation set.
The early stopping strategy has one or more parameters, depending on the variation. One parameter controls how many epochs we’ll continue training after we first detect that the validation data set accuracy has not improved. In another variation, a second parameter will control how many times we’ll allow that redemption process to repeat. In yet another variation, we might also choose to adjust the optimization algorithm to help break through any plateaus in learning.
As training progresses, the driver collects the following data for me:
The network’s accuracy. This is rather simple high-level statistics about the neural network’s performance on the training (and, if it’s used, the validation) data set(s). How often does the neural network guess the correct chord.
The neural network’s weights and biases. At every epoch, the neural network’s weights and biases get dumped to a folder. This allows me to go back and review the training progress, and study how the network learned over time.
The “best so far” weights and biases. When the early stopping strategy is used, training will continue for a few epochs after the best-performing set of weights and biases has been found. This folder of values is used in the neural network that ships in the final product.
This bookkeeping that’s performed by the driver is an essential service that allows me to extract value from every training run—especially the ones that fail. These records allow me to identify which avenues I should pursue further, and those that I can safely ignore.
If I can’t repeat an experiment and get the same result twice, then I can’t really form any decisions based on its outcome. Unfortunately for me, there is a lot of randomness sprinkled around in the training of a neural network, concentrated around the following two areas.
1. Training data is selected at random on every epoch. If I feed a sufficiently-complicated neural network with the same sequence of songs over and over again, it can start to learn that sequence! Put another way, if I trained the network with a sequence of songs until it reached 90% accuracy, then tested its accuracy against the same set of songs, but shuffled around, it may not score so high!
2. Weights and biases get initialized with (millions of) random values. This is done for a number of reasons, but the most interesting one is that random initialization can significantly speed up the learning process.
If I want to compare the difference between two different optimization algorithms, or the effect of any hyperparameter for that matter, I would like to ensure that nothing else has changed between the two experiments.
To control for this randomness, I had to be very careful to use the same pseudorandom number generator in all the places where I needed random values. For any given seed value that I specify to the driver software, it will spit out the same sequence of numbers every time.
By this point, I hope that the importance of the driver is made clear. It gives me some amount of flexibility in the various hyperparameters that I can control, and I need to trust that it executes each of my experiments reliably.
But remember—it takes a very long time to train a chord detector. Each of my experiments is an hours-long ordeal that results in a new chord detector—each one requiring further evaluation!
In the next post, I’m going to talk about how I executed the grid search to find a good set of hyperparameters, and why I felt compelled to acquire a pair of eGPUs to help me get results faster.