A Note on "Auto Tabbing"

I read a few articles about Capo 2 lately, and noticed that some authors have claimed that Capo will "automatically tab out" music. Here's what Capo's product page says:

Tab it Out!

By simply drawing atop the spectrogram, Capo will generate tablature automatically for you. It really doesn't get much easier than this!

The "automatic" bit is related to the process of translating your entered note data (what I like to refer to as "truth notes") into tablature below the spectrogram display. In the future, I plan on adding support for standard notation, though it's a much tougher problem to solve.

To me, the term "Auto Tabbing" is the same as "automatic transcription." There are many researchers that are working to advance this technology, but it's far from ready. I know, because I researched it for much of the past year.

The Research

I started researching different methods of automatic transcription in mid-2009, because I was curious about how far along this technology was, and if it could be integrated into a future version of Capo.

Each of these automatic transcription algorithms start out with some kind of intermediate represenation of the audio data, and then they transfer that into a symbolic form (i.e. note onsets, and durations).

This is where I encountered some computationally expensive spectral representations (The Continuous Wavelet Transform (CWT), Constant Q Transform (CQT), and others.) I implemented all of these spectral transforms so that I could also implement the algorithms presented by the papers I was reading. This would give me an idea of whether they would work in practice.

Boy was I in for a surprise—just implementing the front-end to many of these transcription algorithms had me feeling defeated. In one paper, the authors claimed to have computed a CWT on a 30s audio sample in only 1.9s, where my own implmentation was taking upwards of 15 minutes (on an 8-Core Mac Pro!)

Sure enough, contacting the researchers revealed that they were using a modified version of the CWT (contrary to what the paper said,) which they are keeping as a closely-guarded secret. So that was the end of that…

I then (re-)stumbled on the Constant-Q Transform (which I had first encountered in FuzzMeasure research back in 2004 or 2005.) This is considered by some to be a special case of a wavelet transform. My first implementation was promising (only about half the time, and a tiny fraction of the RAM usage.) Then, I ran with that and made it better.

I grafted some transcription approaches on top of this spectral representation, and realized very quickly that these algorithms are not ready for prime time.

Even the best automatic transcription algorithms today only work with a single instrument voice (i.e. just a violin, or a flute, etc.). Some can go further to transcribe multiple voices of the same instrument (i.e. 3 cellos, 2 flutes, etc.), but their accuracy drops considerably. The best that I encountered was in the 60-70% range.

The Road-Block

I think that the major problem that affects automatic transcription right now is in filtering and separation. Because the single-voice algorithms are progressing steadily, it would seem that one simply has to separate the individual instruments into different streams, and then apply the algorithms on each stream.

Unfortunately, you can't unbake a cake. The stereo recordings we listen to are mixed down from many tracks, and processed heavily, in order to get the final result.

My opinion is that we're stuck at a road block, and we'll only be able to pass through it when music is distributed in a multitrack form, with mixing/processing done by the listener.

In short: Don't hold your breath.