• Chris Liscio
We just shipped Capo touch and Capo 3.1 last week—on the same day!—and a large part of the time on this project was spent building Chord Intelligence.
I feel that the name is fitting; Chord Intelligence is trained from a collection of music and applies its knowledge to what it hears in an audio recording. You read that correctly: Capo's new chord detection engine is trained, and it does indeed learn.
Early in Capo's life, I built a "chord marker" feature that allowed the user to place a marker in the song and attach a chord name to it. You could enter simple strings such as 'bbmaj7', 'f#9', 'gdim', etc. and Capo spit out the correct chord.
At the time I thought, "Wouldn't it be awesome if the user went to drop the chord marker, and the chord was already filled in?" This is when I took a long trip down the rabbit hole.
I built my first chord detection engine in late 2009 / early 2010. It was loosely based on a paper by Adam Stark and Mark D. Plumbley, and Adam was a great help as I was trying to understand the finer points of his paper.
This detection engine worked quite well, but users were bombarding me with requests to run the entire detection process automatically. Having dug into the research I knew that this was a much more challenging prospect.
Recorded music is very complex, and it gains further complexity as you add players to a recording. The complexity can be from a spectral standpoint (a snare drum hit can excite the whole frequency range), a harmonic standpoint (a distorted guitar could be missing its fundamental frequency), and a musical standpoint (a guitar playing a chord while a singer accompanies.)
The last point of complexity is the most accessible to understand for musicians. If you played a C major chord, and your singer were to only sing arpeggios containing the C, E, and G notes, then the recording of your (admittedly quite boring) performance would contain just a C major chord.
The real world doesn't work this way. You could keep strumming a C chord and your vocalist could be singing anywhere in the C major scale (let's keep this simple, music majors!) Each note that is not a C, E, or G will change the overall chord in the recording. If your singer hits a B, it's a CMaj7 chord, and if your singer hits a D then it's a Cadd9.
When I built an early version of an automatic detector, I decided to let my original engine loose to see just how bad this challenge would be. How much extra data are we talking, here?
Capo would drop anywhere from 300-400 chord changes on a typical pop song. It would take the user far longer to clean out all the invalid chord changes than it would be for them to just drop them manually as they did in Capo 2.0.
My approach in Capo 3.0 was to focus on cleaning up the messiness in the spectrum (through various forms of averaging and clever filtering) and then further averaging data over the course of a bar.
Because I had integrated a beat detection engine, I had the benefit of knowing where the bars started, and where the beats were located. I attempted filtering at different levels ranging from each quarter note up to the entire bar.
After extensive testing, I determined that the bar averaging was a good start for users and they could use the existing "drop chord" feature to fill in the changes that happened within the bar. Unfortunately this still wasn't good enough.
This approach still suffered a few flaws, unfortunately. When averaging over the course of a bar, the "vocalist and guitarist" problem I discussed earlier could be amplified if the audio mix did not favor the guitarist as much as the singer. In the earlier example, you'd have the information containing both the B (CMaj7) and the D (Cadd9), and the result would be a muddled mess.
Furthermore, when the beat detector didn't nail down the downbeat location, you would have a smearing effect that occurred when combining two halves of two different bars (so you could now mash a C and F together, combined with singing.)
It was a naïve approach, but I found that it seemed to work more often than it didn't—especially after my greatly improved filtering approach. Besides, the point of Capo is that you're already committed to learning songs by ear! Even when it was often wrong, the user had a huge head start on learning a track.
Capo 3's launch was a huge success for us, and the obvious next step was to build Capo Touch. Our users were banging down our door looking for something more mobile, and I wanted to try and give them something amazing.
When we exhibited at NAMM 2014, the response was incredible. Our booth was frequented by a large collection of top-notch musicians, and we have the video to prove it. Being the smart developer I am, I was sure to bring a prototype of Capo touch to the show.
The initial response was amazing, but as I gave demo after demo, it became clear that a mobile version of our software would not be so forgiving when the chords needed excessive massaging as they did on the desktop. When you're armed with a keyboard and mouse, the prospect of editing a few dozen chords isn't so daunting. When it's just your phone and your finger, this is certainly not the case.
When I returned from NAMM and came down from all the excitement of the show, I decided that enough was enough and I needed to tackle all the research I was uncomfortable and afraid of. I had never taken an Artificial Intelligence or higher level Statistics courses at school, but all the research papers I was reading over the years made frequent references to concepts that I was completely unfamiliar with.
I re-read all the papers I've used over the years for reference, and read them again. I got in touch with Taemin Cho to get some clarification on some of his work, and he led me to newer papers which required additional learning on my part. For a solid 6 weeks I was doing nothing but reading papers and exploring in MATLAB.
I resolved to not just build a new chord detection engine for Capo, but to build an entire chord detection engine factory. Armed with my copies of MATLAB and Xcode, and an unwillingness to fail, I set forth on my quest.
During my research work, I managed to build a great deal of tooling. In MATLAB, I built up the training environment which spits out statistical models (i.e. the "intelligence") that get used by Capo. On the Xcode side, I extracted all the audio processing and analysis code that would spit out the data used by MATLAB during the training process.
Of course, I also had to build the Chord Intelligence engine itself, which could take the statistical models and use them to make intelligent decisions about what chords are in the song, and how likely it is to change from one chord to another.
After another few weeks, I felt like I was getting somewhere. The first initial tests of Capo running the Chord Intelligence engine had me stunned. A special debug build of Capo would play the chords along with the original recording, and it sounded like my computer was accompanying the music.
You'd think I was done there. I had a working chord detection engine, it was integrated into Capo on my Mac, and it was surprisingly accurate.
The unfortunate side-effect of using all this sophisticated math in my code was that it is very memory-hungry and CPU-intensive. Half of the battle is figuring out what all the research means and implementing a working version of it. 90% of the battle is making the code run in a reasonable amount of time!
The first working version of Chord Intelligence in Capo 3 took approximately 120s to process a standard 3 minute pop song. On my new Mac Pro. With 8 cores. And 32GB RAM. Crap.
Of course, I stacked the deck against myself, as I usually do. My first crack at these things doesn't often happen with threading (i.e. processing on multiple CPU cores), because that's really stupid when you're trying to figure this stuff out. I also don't take great measures to optimize memory allocations and re-use working memory buffers and so on. Again, it's stupid—premature optimization truly is the root of all evil, especially when you barely understand the math behind what you're building.
My optimization passes began, using Instruments along the way. I parallelized things, and tried to eliminate unnecessary calculations. 120s became 1.2s and I was in good shape. Or so I thought.
It was about this time in the process where I needed to start ensuring that this code would integrate with the iPhone builds of Capo touch, since that was where the code was originally intended to be used. (Side-note: Capo 3.1 wasn't originally in the plans to be launched concurrently!)
9 minutes. It took nine minutes to run the same chord detection processing on my 5th generation iPod touch. I had just spent the better part of a week optimizing the crap out of this code, but apparently I was just beginning.
All told, I think the final builds of Capo touch take approximately 40 seconds to do the same processing on the iPod touch, but I take drastic measures to make sure that the rest of the interface is usable while it's calculating this stuff in the background. Trust me: there is some serious stuff being figured out here behind the scenes.
Oh, and on that Mac Pro? I think it's down to about 400ms or so. Not a bad place to stop for now, without going too far down the path of making my code completely unreadable.
It's very hard to describe how Chord Intelligence works without getting too deep and technical. I've attempted it a few times, and the best I can do is use some analogies and hand-waving.
With enough musical recordings and annotated data (i.e. the chords in the song), you can form some statistical "pictures" of what chords sound like. Imagine going into a music collection and saying, "give me all the C chords!" You'd have a collection of short snippets of audio—some with just a guitar, some a piano, some that have a singer following along, and so forth.
If you do that enough times for enough chords, you now have a rather broad sampling of what every class of chord sounds like regardless of musical styles. Of course, depending on your sample set, you might come up short with certain chords. This is why the current version of Chord Intelligence does not automatically detect diminished chords. There just wasn't enough data in my collection to form a clear enough picture of what it sounds like in real recordings!
So some statistical data is generated by my training environment to get an idea of what chords sound like, but how do we cope with the problem of having too much information?
Now we go through the same collection of musical data and figure out how many times the music moved from any one chord to any other chord. How frequently did a C go to a C#min? From a C to a G? What about Asus4 to a G#min7?
With this information, you've managed to build up a body of knowledge about how music tends to "move" in practice.
When faced with an audio recording, Capo will look through the entire song and effectively compare it to its database of "pictures" of every chord. For each chunk of audio, the match strength to every class of chord is calculated.
Now this output data is analyzed once again with the "movement information", and Chord Intelligence kicks in to filter out very unlikely movements. If a guitar solo pulls a chord out of line briefly, Chord Intelligence will likely filter that chord change out.
Of course, this approach can never be truly perfect if your goal is to figure out only what the pianist or guitarist is playing. Heck, there might even be two of each!
But, sometimes Capo "fails" and in many cases the reported "failure" is subjective.
Let's go back to the singer and guitarist. Look, I get that you think the chord is "wrong" because the guitarist played a C, and you play guitar and you only care to play a C. But, guess what? That truly is a CMaj7 when the singer is holding that B note. Capo can't look past that fact, and maybe in your cover of the song you'll want to play the CMaj7 to add more flavour to your performance. Make it your own!
There are also other interesting issues I have noticed in certain recordings that have performed poorly with Chord Intelligence. Not all the instruments (or the singer) are in tune! When you have one instrument (or an ill singer?) that is pulling a part of the recording away from its true pitch, then when things are folded down for Capo's analysis there's not much we can do about that.
I actually have a stage in the processing that will align everything to a standard pitch reference within the recording to aid in the detection process, however it cannot ever cope with only a part of the song going out of tune.
Anyway, I love that there are these challenges out there and they make for interesting conversations with other musicians and developers. I've already had many of them with customers over e-mail, and It's fun when I can see a user transition from blind rage about Capo "not working" into a deep reflection about the music theory. "Oh, yeah, I can see now why it thought that was an Fm7. The bassist is doing this, while the singer is…"
Guess what, I made you use your ears! ;)
There are legitimate failures, though. Unfortunately a great deal of the ones I've encountered can be explained by my aggressive filtering scheme. Let's take Peg by Steely Dan, for instance.
In some bars, Capo sees this as a simple C -> G progression. In other spots, Capo will carry the C for a few bars, then the G for a few bars. If we look at a bar with the C -> G, and double-click the C, there is a CMaj7 in the "short list" of detection results. Click it, and it sounds much closer to the recording. Hmm.
The bassist is following with the C and G root notes, but there are some passing tones in between. For each bar, the CMaj7 chord has the C note played by the bass for two beats, then the next two beats the bassist throws down a G, then a B and a D. That's moving fairly quickly, and the passing tones certainly don't help Chord Intelligence do its thing. It's simply missing the quick changes, and there is enough other stuff going on to cause Capo to stay on the C or G.
According to tab that I found online, the main verse is supposedly a repeating Cmaj7 (CEGB) to Gsus2 (GAD) on the guitar.
Double-click the G, and there's no Gsus2 but there's a Gsus4 in the short list! Press the Gsus4 and it sounds fairly good in that context. Type in a Gsus2 and it sounds only OK. But let's go back to that G: it sounds even closer to the content of the actual recording! You can't really fault Capo for that one.
While Capo may have missed a lot of those C -> G transitions, you know they're there because Capo told you in a part of the verse and it sounded correct. You now know how to play the bulk of the verse after maybe 5 seconds of work, tops. Just sayin'. :)
Chord Intelligence shipped last Thursday in Capo 3.1 and Capo touch. The response has been very positive, especially from customers that have used the older versions and notice the big improvements.
Capo 3.1 was shipped as a completely free update for existing Capo 3 customers. I truly appreciate the support you've all shown during my big launch last October, and I'm so happy that I could give you the results of this R&D so quickly.
Capo touch is a whole new app and is normally $9.99. I put both products on sale for this launch and if you hurry, you can get both for less than the price of a regularly-priced copy of the desktop product.
I'm not done with Chord Intelligence yet. I built a large amount of tooling for this work because I knew that the area of chord detection is going to be ever-evolving. There is a lot of research that I have yet to integrate, and devices will continue to improve in their capabilities.
I also hope to expand my training data, and I will be recruiting some help from musicians in the near future to help me build my training data set. Stay tuned for some further information on that front.