Swimming in OpenCL

Please excuse the vague post, as I don't have anything specific I'd like to share just yet. However, what I'd like to do here is call attention to my new favorite part of Mac OS X 10.6 Snow Leopard—OpenCL.

I'm working on some incredible technology for Capo lately, but it's pretty heavyweight stuff. I'm processing audio data to produce a fancy visualization of its spectral content (not using the FFT). Unfortunately, running this operation is quite slow, so I've been trying to parallelize, and optimize it as best as I can.

In practice, I don't intend to run the processing on entire audio files, but I've been using that as a worst-case example to test the throughput of a few approaches I've been working on. The input file for the tests below is a 45-second wave file, and the tool I built produces a detailed image file containing a time-vs-frequency view of the entire file.

All the test results were collected on my 8-core Mac Pro. I realize this isn't representative of users' machines, but it allows me to verify that all the system's computing resources are being utilized—it's not trivial to peg 8 cores. And, seeing how this is where computers are heading in the near future, this seems like a smart thing to focus on…

Initial Approach

I implemented the algorithm in question using a MATLAB source file for reference. I used my optimized math routines (which are sped up using vecLib/vDSP stuff), but it still took a while—492 seconds to process the file!

I fully expected this to be slow, so I wasn't surprised with the result. It operated on only one core, and used up a modest amount of memory. At least I had a baseline in place, and output data to verify the optimized routines against.

(Note: This test was run at a lower resolution than the ones below, as it was unbearably slow as it stands. I think the true timing for this algorithm, with the same analysis parameters for the file, were more on the order of thousands of seconds. Running a test with the same resolution used in the rest of the tests, using a 0.91 second-long input file, resulted in 10 minutes of running time—brutal!)

NSOperationQueue

I know this path very well—NSOperationQueue is employed to speed up waterfall calculations in FuzzMeasure, and it's a very easy API to work with. After some intense optimization work that lasted over a day, I managed to get the process running in 115 seconds.

This was a huge improvement, and I even managed to integrate this algorithm into Capo with a preliminary UI wrapped around it. Unfortunately, I had to really turn down the resolution to get reasonable performance numbers.

There was also another interesting side effect, which is that the Capo audio engine would be bogged down as the file was being processed in the background. You see, NSOperationQueue seems to run at a normal priority, so it could preempt anything else that's happening in your application. You can mitigate the problem by reducing the number of concurrent operations on the queue, but you don't have much space to reduce that load on a 2-cpu machine.

Also, the way my math library is structured, I decided to trade memory for computation time, so I had to do a bunch of work to balance the memory load (e.g. how much of the audio file is loaded at once before spawning off lots of copies of its data so it can be read in parallel by all these threads) during runtime. By the end, my code wasn't very pretty at all.

Finally, the way this code was all written, it wasn't very easy to have on-demand updates of the parameters used to generate the image of the spectral data. So you couldn't have a user-defined frequency range parameter that is manipulated in real time, as updating parameters would result in things being re-calculated again. These are design issues on my part (it was a tradeoff for overall algorithm run speed), but I made these decisions consciously.

OpenCL Attempt 1—Scalar Code

OpenCL excites a lot of people, and they seem to go ga-ga over the fact that you can schedule work for your GPU to do. However, it's also an amazingly expressive way to write parallelized code for multi-core CPUs.

I didn't try the OpenCL route until I had become sufficiently frustrated with my NSOperationQueue implementation. I had optimized it as much as I could handle (without severely obfuscating a ton of code, and making the whole implementation very fragile), and I really didn't want to start thinking about making a future release of Capo 10.6-only so soon.

That said, I really wanted to know for sure that this would offer some kind of benefit over what I've been doing so far. Heck, maybe I could use the GPU to do my bidding…

Well, I whipped up a Cocoa wrapper for OpenCL (which I hope to share once I add some more features to it), and wrote my first kernel for OpenCL in a few hours (with the OpenCL spec at-hand). Once I wrapped my head around the whole process, I stepped back and realized that my code was much cleaner, and readable, than before.

Still, this was a very naïve implementation, so I wasn't expecting magic out of it. After working out the bugs, I measured 30.87 seconds! Holy cow—that's a huge gain!

At this point, I could have stopped. I had basically shaved >60% of the time off my NSOperationQueue implementation, but I wanted to push it a little further, because it still wasn't running all that great on my dual-core 13" MacBook Pro.

I did not yet integrate this into Capo, as I only just finished writing up the test code (it's not hard to move over), but what I did notice is that OpenCL is scheduling these work items using the low-priority Grand Central dispatch queue. This means that I will be playing very nicely with the rest of the system as this monstrous operation is happening—score one more win for OpenCL!

OpenCL Attempt 2—Vectorized Code

The Intel CPUs ship with decent vector units these days, and OpenCL lets you write vectorized code very easily. You can cut a loop into a quarter of the operations, and can work on 4 elements at once, simply by switching to the float4 data type, and playing around with indexes into your data arrays.

This was tricky to get working—maybe 3-4 hours of toying around and debugging the code before I realized I had a mathematical error (I was combining the result of a non-linear operator—oops!) contributing to garbled output. After I got the bugs worked out, I was getting a result of 14.1 seconds.

Absolutely incredible—I basically doubled my runtime by working with vectors.

OpenCL Non-Attempt—Running on the GPU

I'm not planning to ship code that runs on the GPU for this particular algorithm. The GPU is a dodgy thing to work with, and I'm dealing with an algorithm that runs much longer than you want to tie up the GPU for. For instance, I actually manage to completely lock up my system for a full minute as the algorithm runs.

Oh, that's right—this takes a full minute to execute on a GeForce 8800GT. The type of algorithm I'm working with is far better suited to the memory layout of a general purpose computer, its caching strategy, etc.

Furthermore, there's an issue of overhead here…

OpenCL Overhead

When you work with the CPU, you avoid all that overhead of moving data to/from the GPU. With some extra flags specified, you can tell OpenCL that you are supplying your own host memory pointer, and you wish to avoid the copying step.

In my testing experience, it takes almost no time to start up an OpenCL context, compile your OpenCL program, and set up your memory/parameters when you use the CPU. On the GPU, I was losing somewhere between 1-3 seconds for the round-trip.

Conclusion

Overall I'm extremely impressed with what OpenCL brings to the table. It's really not that hard of an API to use (especially now that I have a Cocoa wrapper), and if you work at it, you can get some huge speed gains over a more "traditional" multi-core programming approach such as what you get using NSOperationQueue.

It's not for everyone, for sure, but it's going to make a lot of otherwise complex things easier to do.

Translations

This page has been translated to Serbo-Croatian by Anja Skrba. Thanks, Anja!