Blogblog 


06

October '09

Drawing Waveforms

I get asked about drawing waveforms from time to time. Over the years, I came to realize that this is a black art of sorts, and it requires a combination of some audio and drawing know-how on the Mac to get it right.

But first, a little story.

Once upon a time I used to write audio software for BeOS while I was in university. As almost every audio software author eventually does, I came to a point where I needed to render audio waveforms to the screen. I hacked up a straightforward drawing algorithm, and it worked well.

When I started working on a follow-on project, I decided to re-use the algorithm I wrote for the first application, but it didn’t work so well. The trouble is, when I originally wrote that algorithm, the audio clips in question were all very tiny—less than 2s. Now I was dealing with much longer clips (up to a few minutes, in practice), and the algorithm didn’t scale well at all.

Around this time, I interviewed with Sonic Foundry, with the hopes of joining the Vegas team. During my interview, I asked, “How do you guys draw waveforms on-screen for large audio clips, and so quickly!?”

“That’s proprietary information, sorry.”

At the time, I just figured the guys were just avoiding a long, drawn-out response. I coded this up myself, except for the fact that it wasn’t so fast—so it can’t be that difficult, right? Unfortunately, I got similar responses from other people I had asked afterwards.

Regardless of whether you’re new to audio, or you’ve been doing it for a while, you are aware that there aren’t too many books on the topic. Furthermore, you probably aren’t going to find too much in the way of detailed algorithms, or even pseudocode, to help you out.

I’m starting to realize that the reason is two-fold.

First off, there really aren’t a lot of people out there who need to draw audio waveforms (or large data sets, for that matter) to screen. Second, it’s really not all that hard once you think about it for a while.

Overview

Drawing waveforms boils down to a few major stages: acquisition, reduction, storage, and drawing.

For each of the stages, you have many implementation options, and you’ll choose the simplest one that’ll serve your application. I don’t know what your application is, so I’ll use Capo as the main example for this post, and throw around some hypothetical situations where necessary.

Early on, you have to set some priorities: Speed, Accuracy, and Display Quality. The order of those priorities will help you decide how to build your drawing algorithm, down to the individual stages.

In Capo, I wanted to make Display Quality the top priority, followed by Speed, and then Accuracy. Because Capo would never be used to do sample-precise edits, I could throw away a whole lot of data, and then make the waveform look as good as possible in a short time frame.

If I were writing an audio editor, my priorities might be Accuracy, followed by Speed, and then Display Quality. For a sequencer (like Garage Band), I’d choose Speed, Display Quality, then Accuracy, because you’re only viewing the audio at a high level, and it’s part of a larger group of parts. Make sense?

Once you have an idea of what you need, you will have a clear picture of how to proceed.

Acquisition

This is almost worth a post of its own. I like using the ExtAudioFile{Open,Seek,Read,Close} API set from AudioToolbox.framework to open various audio file formats, but you may choose a combo of AudioFile+AudioConverter (ExtAudioFile wraps these for you), or QuickTime’s APIs, or whatever else floats your boat.

Your decision of API to get the source data is entirely up to your application. You can’t extract movie audio with (Ext)AudioFile APIs, for instance, so they might not help much when writing a video editing UI. Alternatively, you may have your own proprietary format, or record short samples into memory, etc.

Given the above, I’m going to assume you’re working with a list of floating-point values representing the audio, because that’ll be helpful later on. Using ExtAudioFile, or an AudioConverter, make sure that your host format is set for floats, and you should be good.

When you’re pulling data from a file, keep in mind that it’s not going to be very quick, even on an SSD drive, thanks to format conversions. I’d advise doing all this work in an auxiliary thread, no matter how you get your audio, because it’ll keep your application responsive.

In Capo’s case, there is a separate thread that walks the entire audio file, doing the acquisition, reduction, and storage steps all at once. Because Display Quality and Performance were high on the priority list, the drawing step is done only when needed.

Reduction

Audio contains tons of delicious data. Unfortunately, when accuracy isn’t the top priority, it’s far too much data to be shown on the screen. With 44,100 samples/second, a second of audio would span ~17 30″ Cinema Displays if you displayed one sample value per each horizontal pixel.

If accuracy is your top priority, you’re still going to be throwing lots of data away most of the time, except when your user wants to maintain a 1:1 sample:pixel ratio (or, in some cases, I’ve seen a sample take up more than 1 pixel, for very fine editing). If you’re writing an editor, or some other application that needs high-detail access to the source data, you will have to re-run the reduction step as the user changes the zoom level. When the user wants to see 1:1 samples:pixels, you won’t throw anything away. When the user wishes to see 200:1 samples:pixels, you’ll throw away 199 samples for every pixel you’re displaying.

In the case of Capo, I chose to create an overview data set for the ‘maximum zoom’ level, and keep that on the heap (a 5 minute song should take ~1MB RAM). In my case, I chose a maximum resolution of 50 samples per pixel, and created a data set from that. As the user zooms out, I then sample the overview data set to get the lower-resolution versions of the data. Accuracy isn’t great, but it’s pretty fast.

Now, when I talk about “throwing away”, or “sampling” the data set, I’m not simply discarding data. In some cases, randomly choosing samples to include in the final output will work just fine. However, you may encounter some pretty annoying artifacts (missing transients, jumping peaks, etc) when you change zoom levels or resize the display. If Display Quality is low on your list—who cares?

If you do care, you have a few options. Within each “bin” of the original audio, you can take a min/max pair, just the maximum magnitude, or an average. I have found the maximum magnitude to work well for the majority of cases. Here’s an example of what I do in Capo (in pseudocode, of sorts):

// source_audio contains the raw sample data
// overview_waveform will be filled with the 'sampled' waveform data
// N is the 'bin size' determined by the current zoom level
for ( i = 0; i < sizeof(source_audio); i += N ) {
    overview_waveform[i/N] = take_max_value_of( &(source_audio[i]), N )
}

Once you have your reduced data set, then you can put it on the screen.

Display

Here's where you have the most leeway in your implementation. I use the Quartz API to do my drawing. I prefer the family of C CoreGraphics CG* calls, because they're portable to CoreAnimation/iPhone coding, the most feature-rich, and generally quicker than their Cocoa equivalents. I won't get into any alternatives here (e.g. OpenGL), to keep it simple.

If we stick with the Capo example, then we've chosen to use the maximum magnitude data to draw our waveform. By doing so, we can exploit the fact that the waveform is going to be symmetric along the X axis, and only create one half of the final waveform path using some CGAffineTransform magic.

In the past, developers would create waveforms in pixel buffers using a series of vertical lines to represent the magnitudes of the samples. I like to call this the "traditional waveform drawing". It's still used quite a bit today, and in some cases it works great (especially when showing very small waveforms, and pixels are scarce like in a multitrack audio editor).

Traditional Waveform

I personally prefer to utilize Quartz paths so that I get some nice anti-aliasing to the waveform edge. Because Capo features the waveform so prominently in the display, I wanted to ensure I got top-notch output. Quartz paths gave me that guarantee.

To build the half-path, we'll also be exploiting the fact that both CoreAudio and Quartz represent points using floating-point values. Sadly, this code is slightly less awesome in 64-bit mode, since CGFloats become doubles, and you have to convert the single-precision audio floats over to double-precision pixels. Luckily there are quick routines for that conversion in Accelerate.framework (A whole 'nother blog post, I know...).

- (CGPathRef)giveMeAPath
{
    // Assume mAudioPoints is a float* with your audio points 
    // (with {sampleIndex,value} pairs), and mAudioPointCount 
    // contains the # of points in the buffer.

    CGMutablePathRef path = CGPathCreateMutable();
    CGPathAddLines( path, NULL, mAudioPoints, mAudioPointCount ); // magic!
    return path;
}

Because magnitudes are represented in the range [0,1], and we're using Quartz, we can build a transform that'll scale the waveform path to fit inside half the height of the view, and then append another transform that'll translate/scale the path so it's flipped upside-down, and appears below the X axis line (which corresponds to a sample value of 0.0). Here's a zoomed in example of what I'm talking about.

Flipped Waveform

And here's some code to give you an idea of what's going on to create the whole path:

// Get the overview waveform data (taking into account the level of detail to
// create the reduced data set)
CGPathRef halfPath = [waveform giveMeAPath];

// Build the destination path
CGMutablePathRef path = CGPathCreateMutable();

// Transform to fit the waveform ([0,1] range) into the vertical space 
// ([halfHeight,height] range)
double halfHeight = floor( NSHeight( self.bounds ) / 2.0 );
CGAffineTransform xf = CGAffineTransformIdentity;
xf = CGAffineTransformTranslate( xf, 0.0, halfHeight );
xf = CGAffineTransformScale( xf, 1.0, halfHeight );

// Add the transformed path to the destination path
CGPathAddPath( path, &xf, halfPath );

// Transform to fit the waveform ([0,1] range) into the vertical space
// ([0,halfHeight] range), flipping the Y axis
xf = CGAffineTransformIdentity;
xf = CGAffineTransformTranslate( xf, 0.0, halfHeight );
xf = CGAffineTransformScale( xf, 1.0, -halfHeight );

// Add the transformed path to the destination path
CGPathAddPath( path, &xf, halfPath );

CGPathRelease( halfPath ); // clean up!

// Now, path contains the full waveform path.

Once you have this path, you have a bunch of options for drawing it. For instance, you could fill the path with a solid color, turn the path into a mask and draw a gradient (that's how Capo does it), etc.

Keep in mind, though, that a complex path with lots of points can be slow to draw. Be certain that you don't include more data points in your path than there are horizontal pixels on the screen—they won't be visible, anyway. If necessary, draw in a separate thread to an image, or use CoreAnimation to ensure your drawing happens asynchronously.

Use Shark/Instruments to help you decide whether this needs to be done—it's complicated work, and tough code to get working correctly with very few drawing artefacts. You don't even want to know the crazy code I had to get working in TapeDeck to have chunks of the waveform paged onto the screen. (Well, you might, but that's proprietary information, sorry. ;))

In Conclusion

People have suggested to me in the past that Apple should step up and hand us an API that would give waveform-drawing facilities (and graphs, too!). I disagree, and if Apple were to ever do this, I'd probably never use it. There are simply far too many application-specific design decisions that go into creating a waveform display engine, and whatever Apple would offer would probably only cover a small handful of use cases.

Hopefully the above information can help you build a waveform algorithm that suits your application well. I think that by breaking the problem up into separate sub-problems, you can build a solution that'll work best for your needs.

28 Comments

alastair

Over a year ago

An even better option than minimaxing if you want really high quality results would be to bin the sample values in each pixel and then shade the pixels according to the frequency with which the value was above (or below) that point.

Of course, drawing this is then going to require a somewhat different approach.


chris

Over a year ago

@alastair: Got an example of what that looks like? My gut says that’d result in all kinds of display artifacts…


Uli Kusterer

Over a year ago

Hi,

great article! Thanks for sharing this knowledge with the world. While most of this is information that can be found after searching, it is great to have the positive reinforcement that this is the general way to go about it. I was afraid I’d have to deal with sample rates and other stuff, but it’s really just a bunch of floats that I can grab from ExtAudioFile, great! :-)

As to Alastair’s approach, I think he’s essentially anti-aliasing. The areas that would have more lines get condensed into darker, while the areas that have fewer get lighter. That should work fine, if you like the fuzziness that anti-aliasing gives you.

Tempted to sit down and write my own audio file view now … dang!

Thanks for this article! – Uli


Andrew Kimpton

Over a year ago

Taking a straight magnitude and flipping it is quick (and halves the data storage) however it can cause the display to misrepresent any sort of DC bias that’s present in the audio – for many apps this is fine but if your scenario is more high end then this could be a concern. Equally a lot of people expect to see a mirror image waveform, and you’d be surprised how ‘uneven’ even well mastered audio can look !

Color coding according spectral power (via an FFT) has been done in a couple of high end audio editors (IIRC Pyramix might do this, and I think Waves does it in one of their plugins). Unless you know how to read what your looking at it can be confusing for a user.


Volker

Over a year ago

Hi,

been thru this nearly a year ago. I have to deal with 500 kHz sampled data files as well, which makes the down-sampling process for proper drawing a pain in the ***. In the end I found a way that works well and is fast. I use the flipping approach until a certain zoom level is reached and switch to draw all samples, what the hell mode. This approach is necessary since my users need to analyse the sound files for contained bat calls and do need to access stuff like DC offset, which is producing the impression of noise in the flipped display. see: http://www.ecoobs.de/art/her-bcAnalyze-Screenshot.png


Volker

Over a year ago

In addition: I dismissed using the min/max approach and favor an approach based on the RMS value of a window of samples. basically, with some tricks like overlapping windows and such. gives more appropriate results, but then, i am dealing not with music.


chris

Over a year ago

@Andrew Good to hear from an Ex-Be reader… :)

You’re very right about the DC bias. I was going to give an example of a low-frequency 20Hz sine wave modulated with a high-frequency 1kHz sine wave, which would not render properly with the ‘maximum magnitude’ method I walked through.

Min/max (value, not magnitude) tends to catch the DC issue well enough.

You’re right about those fancy spectral displays being confusing for most users. I leave all the fancy visualizations up to FuzzMeasure. However, I might have some tricks up my sleeve for Capo soon enough… ;)


chris

Over a year ago

@Volker Yes, the hybrid approach is often called for beyond a certain threshold.

In FuzzMeasure, I deal with a logarithmic X axis, where the curve-drawing approach must change as the data becomes more dense for higher frequencies.

In the low-frequency range I can get real fancy, approximating the missing waveform data using a Beziér path, and in the high-frequency range I move to min/max. Seems to work real well.


Andrew Kimpton

Over a year ago

Hi Chris ! Us Ex-Be folk are distributed fairly far and wide . Though Dan Sandler & I are now only seperated by about 10 floors though we’ve both end up in Boston (via very different routes).

A while after leaving Be I worked at BIAS for 4 years – so I’ve seen a lot of waveforms in my time.

When we looked st spectral coloring at BIAS I remember that we were concerned about one or more patents in the same area. I cannot know remember the details though – sorry 8-(


ChrisM

Over a year ago

I’m using max/min pairs in my app. The difficulty I’m experiencing at the moment, however, is how to get efficient interactive resizing. My waveform is scaled in the x and y directions, and since repainting on each resize can be expensive I cache an image and scale that. This is very fast, but unless the waveform is initially drawn at full size the waveform appears under-sampled when resized upwards.


chris

Over a year ago

@ChrisM: This is tough to deal with. I try to avoid scaling an image of the waveform curve as the anti-aliasing done to the original curve just doesn’t behave well when resized.

I recommend an intermediate representation that can be rendered to the screen quickly (in the form of a CGPath, or something like that). Perhaps while live-resizing (you can query an NSView to see if it’s mid-resize) you can avoid resampling the data until the resize is done—I’ve done something similar to this in an earlier FuzzMeasure version.


Fred

Over a year ago

Does anyone know of a good code example that shows how to open an audio file (aif, wave, mp3, and aac) and then convert it to a buffer of Float32 as suggested in this article? If the incoming file is stereo, I’d like two buffers and if mono, just one.

So far, I’m having terrible luck using AudioConverterFillComplexBuffer. I’ve followed numerous examples from the web and AudioConverterFillComplexBuffer keeps returning osErr = -50. The callback seems to get called once, but never again (because of the error). Or, if I twiddle the channel count manually, I can get osErr=0, but the buffer seems to be empty and the callback gets called only once. Any ideas where I can go for help? Thanks!


chris

Over a year ago

@Fred: I don’t know of any offhand. If you’re struggling, I recommend you first try using the ExtAudioFile API, and then later move to a separate AudioFile / AudioConverter approach (which I’ve yet to find necessary, FWIW).

With ExtAudioFile, the process boils down to opening the file, specifying your desired client format (Float32, deinterleaved in your case for two separate buffers), and then calling ExtAudioFileRead repeatedly to get your data.


Bastian

Over a year ago

Your post just solved a huge problem I was struggling with for the last few days. Thanks a lot!

(There really is not too much information out there on these topics)


John Myers

Over a year ago

Can you elaborate on the ExtAudioFile extraction process? Specifically, specifying your desired client format? I can get the data out but it seems to be random numbers which I suspect is because I am interpreting it wrong.


Jared

Over a year ago

Arrived via a pointer Volker dropped into Apple dev forums; so nice of each of you to take the time to help people understand this stuff.


GW Rodriguez

Over a year ago

This post is great! Got one questions, above he mentioned that he kept the data set of the maximum zoom (50 samples per pixel) and later he say’s that he uses CGPoints. I am curios as to how to keep all that data in a nice neat place?

I am assuming its an array of CGPoints but how do you define such an array with a variable and unknown element size?

Thanks GW


tommy

Over a year ago

Hi, Im not a sound guy nor am I aspiring to become one – and maybe thats my problem.

Im currently building an app in which one can play pre recorded sound pieces or record new ones. I have also added a view in which one can trim the sound by pinching a neat overlay, here comes the tricky part: an overlay-over-what?. Ofcourse I want it to be a fancy wave as in Audacity or atleast something that a user might feel represents their sound.

So, I really need to represent the sound as a wave and display in this clipping window but it seems that I simply cannot manage to get this together. I have built a graphing tool and everything needed but I just can´t understand how to get hold of the data (which you refer to as a series of floats in your post) that I want to represent. Im totally amazed over how hard this is for me and lately I’ve started to think that I just should draw a picture of the mp3 instead of generating a wave of it :/

I hope that you can hear my desperation and are willing to help/bombarde me with neat tutorials and examples of how to extract this float-data, since I simply cannot do it by myself it seems :(

Sincerely, Tommy


Bonzo

Over a year ago

Thanks for this post. It really helped me to get my project in gear, while still giving me the feeling of accomplishing it myself. Its hard work, but worth the time. A great overview of all steps of the implementation. Keep it up!


Camille Goudeseune

Over a year ago

This peer-reviewed article “Effective browsing of long audio recordings” generalizes min-max to min-mean-max and efficiently avoids undersampling. C++ source code is at https://github.com/camilleg/timeliner .


Camille Goudeseune

Over a year ago

(Whoops: didn’t realize the URL http://dl.acm.org/citation.cfm?id=2390831 would get hidden inside my name.)


Joey

Over a year ago

Great article. I’m writing some code to render audio from a wav file in Java. I wrote a small RIFF framework for parsing chunks and I’d like to use it to playback the PCM data as well as render it visually. Unfortunately, I’m finding the most expensive part of this is doing the little endian to big endian conversion of the samples, as most audio players will expect raw PCM data to be big endian.

I’m caught in between simply always reading the audio as little endian, converting to float so that can be used easily with 2D drawing or doing all the processing on separate thread which means that my UI components have to be a bit more savvy to do the work on a separate thread and use a spinning indicator before it can be visually rendered.

But if I do that, then my framework doesn’t also convert to big endian for playback. Any suggestions as to what would normally be done.


Karl Hiner

Over a year ago

Really nice post and discussion. There really is too little information about this problem.

I’m also facing an issue with live-zooming/scrolling and slow redraws, for a mobile app – I specify a minimum Samples-Per-Pixel of, say, 0.5, so I’m only collecting a max of (width-in-pixels/2) samples, but there is a read from disk for each one of these – can’t really cache an entire file for large files, and there is no way to do get the periodic data in a single read.

I like your idea of keeping the current level of granularity until the user stops zooming/scrolling through the waveform, and only then resampling to reflect the new zoom level.

I’m still struggling with the tradeoff of accuracy, which is very important to me, and appearance, which is also important. Andrew Kimpton is right that more and more people expect a symmetrical waveform, and associate that look with high-quality products like Soundcloud… even though it’s a lie! ;) I allow views anywhere from several samples to an entire song file, so making all these potential views look good is challenging.

Keep up the good work!


Jamie Lauckner

Over a year ago

Interesting since I wish to understand how to represent visual structures as audio; though I am not familiar with the technology.


Add New Comment

Post your comment

Anti-Spam Quiz: