Capo 2's Innovation

I recently wrote a lot about why capo doesn't do automatic transcription, but there was still more to the story.

Obviously, I shipped Capo 2, and it is a huge success! But, why? Where is the innovation?

Keeping Records

During my research on automatic transcription, I was using Capo regularly to figure out new music. I can't listen to music for very long without finding something new that I'd like to learn using Capo—this is one of the reasons I don't listen to music while I'm working. :)

When I started using Capo in the beginning, I would often learn passages without writing them down, and forget them shortly thereafter. Some solos would stick, but if I didn't practice them long enough, they'd fade away.

I then tried to keep a pad of transcription paper nearby, and a pencil that I could use to write down the licks I was learning. The paper piled up, and it was tough to keep organized.

For instance, I'd learn licks in the middle of a song, which meant that I had to annotate the papers with timestamps. That way I could remember where these licks appeared in the original recordings. I'd save some Capo documents with markers at the points I was working on, but the system was very disjoint.

During this period, I decided:

Capo 2 must be better than a pencil and paper.

I would not release anything relating to transcription unless it was easier to use than a pad of paper and a pencil on my desk. I thought that automatic transcription would achieve this, but I was wrong.

The Revelation

Even the best automatic transcription approach is going to result in false positives. A tom hit on the drums could be mistakenly identified as a note, a vocal could be merged with a guitar line, a note held for a long time could be accidentally split into multiple repeating notes, and so on.

In the end, the user must filter the transcription data so that they get the "truth notes", and eliminate the bad ones. But the nature of Capo is to serve users who don't even know what the right notes are yet—they're trying to figure that out!

This is when I realized something very important:

Users should not have to spend more time filtering bad data than entering good data.

Beautiful Sound

When I was experimenting with automatic transcription, I worked with many different spectral representations of audio data. In order to see that the algorithms were working, my test application would dump a TIFF image file that contained the time-frequency data.

Some of the representations resulted in very coarse data streams, with muddy low-frequency data displays. Thse were primarily the FFT-based approaches which were fast, but not really accurate.

However, when I was into the wavelet category (including the Constant-Q transform), the output was stunning. For a particular guitar solo that I was using as an example, every single note was very clearly visible.

At this point I realized something. When I'm looking at this representation, it's very clear what's going on. I don't know exactly what notes are being shown, but I bet that if I showed this representation alongside the audio being played, the user would be able to identify slides, vibrato, and other performance tips.

For a good while, that's all Capo did—it displayed the spectral representation with a grid to help you figure out what notes you were looking at.

Putting It All Together

When I was looking at this spectrum of data, and the grid that sat beneath it, it hit me. Why don't I just build the same quick-entry piano roll UI that users have been familiar with for almost 20 years!? (Well, at least I was familiar with it.)

So, Capo's UI evolved to one that promoted quick entry of the "truth notes," with the spectrogram providing a helping-hand for identifying note onsets and locations. Dumping tablature based on the entered piano roll display would simply make the notes more readable.

Because our ears are still our most important guide for accurate transcription, I decided it would be wise to play a sound¹ when notes were being entered (and later clicked). That way, we can quickly verify that what we are drawing is what we are hearing in the original recording.

By this point I had built the fastest possible way to enter notes while learning a new lick or passage. I had effectively kicked paper's ass.

The Innovation

The most innovative feature in Capo is its user's experience.

A user can get music into Capo in a number of simple ways. The simplest, of course, is dragging a song from iTunes onto Capo's dock icon.

Once the music is loaded, the user can listen to it slowed down, so that individual notes are more clearly heard. In version 2, the notes are now also visible, so there is further validation of what is being heard by the user.

Finally, the user is able to simply click & drag on top of the spectrogram to generate tablature of the notes they're hearing. The tablature is saved in the Capo document, and you can easily recall that document later to continue where you left off.

The turnaround time from first hearing a lick to tabbing it out and playing it is now incredibly short. Sometimes I need to sit back, look at what I've done, and realize that learning to play my favorite music will never be the same.

I chose to use a piano sound because it is the widest-range instrument, sounding good no matter what note is played. In contrast, if I'd have used a guitar, it'd sound tinny and weird in the very high notes, and flabby for the lower bass notes. ↩