Chord Intelligence mkIV: Preparing for Training

To train Capo’s chord detector efficiently, the collection of songs and chord labels in our training data set must first get processed by a few tools.

These tools are necessary to help expand the data set further, and to prepare all the data for the training software that runs on the GPU. They also help me to evaluate different neural network architectures in my ongoing research.

Generating More Data

In the world of machine learning, practitioners employ a technique known as data augmentation to further expand the size of their training data set. In my case, I apply adjustments to the collection of audio files and chord labels to create a much larger set of songs.

For example, when I transpose the audio of a song I get a new song in a different key. When I slightly de-tune a song, I get a new song that mimics an older tape recording. These are just some of the options available, and each new idea (chosen carefully!) multiplies the size of my training data set.

To expand my data set for Capo’s chord detector, I created a data augmentation tool that runs through all my training data and spits out new audio files with a mixture of transformations applied. In some cases, labels also need adjustment to account for those songs that have changed keys, and managing all this turned out to be fairly tricky.

After running augmentation, I ended up with a very large training data set that was equivalent to more than 36 days—over a month—worth of audio. Amazingly, the tool that trains the deep neural network on the GPU can chew through the entirety of that audio in a little more than an hour, but not without some serious effort on my part.

Preparing for the GPU

The next tool that I run on the training data transforms the audio files and labels into a nearly GPU-ready format, and stores them in a (very large!) SQLite database. This is an important step because it takes more time to prepare a song’s audio file and chord labels for the neural network than it does to execute the training of the neural network on the GPU. Without this tool, the training software would spend most of its time preparing data while the GPU sits idle waiting for its next batch of work.

For each of the songs in the training data set, the tool:

Decompresses the audio (from .mp3 or .m4a), if needed
Transforms it into an intermediate representation (a spectrogram, MFCCs, etc.)
Stores each time slice of the spectral data alongside its textual chord label (e.g. C:min) in the database

The records in the database are arranged in a way that gives me some flexibility in the way that I choose to fetch and interpret the spectral song data later. That flexibility lets me quickly evaluate different neural network architectures, different chord vocabularies, and different segmentations of training vs validation data.

For example, one neural network might require a small number of time slices to detect one chord at a time, while another can work with longer stretches of time to output a sequence of detected chords. Both arrangements can be supported through the use of carefully-crafted database queries.

Segmenting

The final tool that prepares data for training serves two purposes. First, it splits the training data into a training set and a validation set. Second, it applies a transformation to the stored labels to make them suitable for the neural network that I’m evaluating.

By segmenting the data in advance, I can be sure that all my experiments use the same sets of training and validation data. This helps me get repeatable results so I can measure the effect of different hyperparameters.

The transformation is a further optimization step that gets the data even more GPU-ready, but it is also tuned to a specific output labeling of the neural network being trained. For example, if I were training a “major vs minor” chord detector I would apply some logic that transformed C:maj7 chords into C chords, changed F:min9 into F:min, and left G:min alone.

In certain situations, I need to run the segmenting tool multiple times in order to generate a collection of training and validation sets. For example, when I need to run a k-fold cross-validation to evaluate how well the neural network performs with new songs that it has never "heard" before. I plan to share more about how I ran these experiments in an upcoming post.

Ready To Go

After running all the above tools, I end up with a collection of databases that are ready for my GPU-powered training software. All it needs do now is pull the records from disk, and feed them to the GPU to train the deep neural network—easier said than done.

In the next post, I’ll talk about the tool that helped me heat my office last winter as it (repeatedly) trained Capo’s new chord detector.