Help - Search - Members - Calendar
Full Version: lossless codecs and CUDA
Hydrogenaudio Forums > Lossless Audio Compression > Lossless / Other Codecs
Pages: 1, 2, 3
gib
With my recent purchase of a 9000 series nVidia graphics card, I started thinking, has anyone investigated if nVidia's CUDA could be useful for lossless compression? I'm not even remotely close to being a programmer, so I haven't a clue how the code works, but it seems like CUDA is valuable for coding/decoding. I know nVidia is already holding a contest to speed up LAME (which ends in about 2 weeks), so perhaps it could be used to speed up lossless compressors? The fastest modes of several codecs are already blazing fast, approaching the limits of hard drives, but perhaps the high-compression modes could be sped-up through CUDA. Maybe, if the speed-up is enough, developers could even implement more ways to gain compression while still maintaining good encoding rates. It would be pretty cool if compression levels like La's best could be done at 50x or something.

Anyway, my curiosity is large, so just thought I'd ask. :)
Martel
I apologize for being completely incorrect. sad.gif
Garf
QUOTE (Martel @ Jul 13 2008, 10:29) *
If I'm not mistaken, lossless coding usually employs dictionary methods (like LZW/LZMA) which generate a lot of random access and branching operations.


Not at all!

Most lossless audio compressors use large predictive LPC filters. This would be an operation that is well fit to a GPU, if it weren't for a small detail: because of the need to be LOSSLESS, the operations are often integer, not floating point. It would be possible to do it in floating point also, but then there is a need to have PRECISELY defined operations, rounding, precision. Exactly what GPU's dont have.

Despite all the hype, there aren't that many things GPUs are actually good at.
gib
Ah, I see now. Thanks very much for the response, Garf.
Gregory S. Chudov
Here is good news.

An alfa version of flac encoder for GPU.

I only tested it on GTS 250, so i'm eager to hear from people with other cards.

As all my applications, this requires .NET framework.

And this time of course a CUDA-enabled graphics card.

Source code as usual on SourceForge.

UPD1: A bit more optimized version re-tuned to not so paranoid compression levels.
UPD2: added pipe encoding for use with fb2k (encoder parameters: -5 - -o %d)
UPD3: seeking problem with pipe encoding in fb2k fixed, lower compression levels speed up.
UPD4: general speed improvement
UPD5: wasted_bits/lossyWav support
UPD6: final optimizations

Download: Click to view attachment
Older versions: Click to view attachmentClick to view attachmentClick to view attachment
Dr_Colossus
Sounds awesome, care to elaborate on the performance for those of us without a CUDA capable card.
Gregory S. Chudov
Less impressive than i hoped to, but this is only initial version, and GPUs grow faster each day.
On my GTS 250 it's approximately as fast as my C# encoder (which is fast by the way).
FlaCuda -4 achieves the same compression ratio as reference flac -8 (version 1.2.1 on Core 2 Duo@3Gz) at approximately double-triple speed.
FlaCuda -8 is as slow as flac -8, but gives an extra 0.5% of compression ratio.
Would be nice if someone could thoroughly compare them on a different hardware and post his/her results here.
Grunpfnul
No love for ati? *sniff*
Gregory S. Chudov
There is love, but there's no implementation ^^
But i guess someone else can do it, now that we have a proof-of-concept
Case
I ran some tests with my Core i7 940 (stock speed) and GeForce GTX 285. Original wav file was 237368588 bytes in size. Not too impressive results:
FLAC -5 : Elapsed Time : 00:00:08.268 (181929373 bytes)
FLAC -8 : Elapsed Time : 00:00:30.560 (181788832 bytes)
FlaCuda -4 : Elapsed Time : 00:00:09.204 (181892106 bytes)
FlaCuda -5 : Elapsed Time : 00:00:10.904 (181763725 bytes)
FlaCuda -8 : Elapsed Time : 00:00:12.370 (181676614 bytes)
FlaCuda -11: Elapsed Time : 00:00:23.883 (181734405 bytes)
Gregory S. Chudov
Thank you!
Ron Jones
I'm anxious to see how this would perform on the next generation of NVIDIA hardware (GT300), which is supposedly significantly faster in general computational performance than the previous architecture (G200).

Very exciting -- thank you!
thundat00th
QUOTE (Grunpfnul @ Sep 10 2009, 03:24) *
No love for ati? *sniff*

crying.gif as much as i love ati i wish they had as much support for things as nvidia does (they need to get to work on that hardware havok physics)

maybe the "evergreen" release here in a bit will improve things (i hope)

as far as this goes, i would be interested in lossy gpu encoding, and that might work a bit better regarding the inaccurate floating point calculations

ati stream support crying.gif pwease?
hlloyge
Here are my test results:

Klaus Shultze - Dreams Deluxe Edition, size 797 MB
Core2Duo 8200, Geforce 9600GT with passive cooling

Encoding with FLAC 1.2.1 in command line, -6, version from Sourceforge, 38 seconds

And this...

PS D:\temp_2> .\CUETools.FlaCuda.exe -6 '.\Klaus Schulze - Dreams Deluxe Edition.wav'
CUETools.FlaCuda, Copyright Đ 2009 Gregory S. Chudov.
This is free software under the GNU GPLv3+ license; There is NO WARRANTY, to
the extent permitted by law. <http://www.gnu.org/licenses/> for details.
Filename : .\Klaus Schulze - Dreams Deluxe Edition.wav
File Info : 44100kHz; 2 channel; 16 bit; 01:19:00.8800000
Results : 61,11x; 499280528 bytes in 00:01:17.5764372 seconds;

Windows 7 32 bit.

Well... not that impressive biggrin.gif

(edit) wrote 10 seconds too much for flac encode...
Gregory S. Chudov
What was the file size for flac -6? We should compare the speed at the same compression ratio, e.g. output file size, not at the same compression level, because e.g. -6 for flac is much lower compression than -6 for flacuda. Please, try to compare flacuda -5 vs flac -8, and compare both execution times and file sizes.

Here's a graph i made of Case's results:


This shows x3 speedup of flac -8 compression.
Wombat
Not to shabby. Tried it on a C2D@3600+GTX260

Dream Theater, Awake

Original 793.976.444 Bytes
Flac 1.21 -8 568.604.561 Bytes ~94 sec. encoding time
Flaccuda -8 567.956.198 Bytes ~53 sec.

I donīt have a recent Flake version at hand so i donīt know how much comes from Cuda alone.

Edit:
Flaccuda -6 568.280.716 Bytes ~48 sec.
GHammer
This is on a 9500 GT

FlaCuda
Filename : Clocks.wav
File Info : 44100kHz; 2 channel; 16 bit; 00:05:07.4670000
Results : 43.10x; 35657424 bytes in 00:00:07.1331000 seconds;

Flac 1.2.1
Clocks.wav: wrote 35796074 bytes, ratio=0.660
2.91 seconds

Both were just run as <executable> Clocks.wav
Gregory S. Chudov
QUOTE (GHammer @ Sep 11 2009, 05:21) *
Flac 1.2.1
Clocks.wav: wrote 35796074 bytes, ratio=0.660
2.91 seconds

That's a bit too small file for comparison. And it's better to compare against flac -8. Default flac compression level is very fast, i don't think it can be beaten by FlaCuda, at least yet. FlaCuda is focusing on higher compression.
Lucho
GPU audio encoding will be useful when OpenCL get adopted by both ATI and Nvidia for now is just "proof of concept"
hlloyge
Here I am again, this time, more detailed:

Flac 1.2.1 vs Cuda 01

File: album.wav 643566044

Windows 7, C2Q9400 @ 2.66 GHz, Geforce 9500 GS

flac -8: wrote 405957413 bytes, ratio=0,631 in 99 seconds
cuda -8: 34,98x; 405731414 bytes in 00:01:44.2910429 seconds;

Is there multicore flac encoder? smile.gif that would be a nice thing to test...
Justin Ruggles
QUOTE (hlloyge @ Sep 11 2009, 14:52) *
Is there multicore flac encoder? smile.gif that would be a nice thing to test...

http://softlab-pro-web.technion.ac.il/Proj.../downloads.html

I haven't tested this personally or done anything about trying to adapt the code for inclusion in Flake.
gib
Hey, wow. This topic of mine was bumped, and with proof of concept software to boot. Thank you, Gregory!

Here are my results to add to the data (I used flac 1.2.1 -8 and Flacuda01 -8 as suggested):

CPU: Athlon X2 @ 2.35 GHz
GPU: 9600 GSO @ 600 MHz

File 1: 656647868 bytes
Flac: 466183490 in 148 seconds
cuda: 465898530 in 65 seconds

File 2: 654389948 bytes
Flac: 362792762 in 145 seconds
cuda: 360670158 in 63 seconds

More than 2x faster and better compression too. That's pretty impressive.
PatchWorKs
Well, I believe that even a small gain is always welcome.

I'm not a developer, so I dunno if possible, but: what about a liboil-like library but for GPGPU encodings, so *any* codec could benefit from GPU computations ?
hlloyge
Again: C2D8200, Geforce 9600GT

album.wav to flac -8

original: 578046380
flac: 344489508 in 80 seconds
cuda: 344226134 bytes in 00:00:52.8150209 seconds

Nice.
Gregory S. Chudov
QUOTE (PatchWorKs @ Sep 12 2009, 12:40) *
I'm not a developer, so I dunno if possible, but: what about a liboil-like library but for GPGPU encodings, so *any* codec could benefit from GPU computations ?

Not sure. The code i wrote is quite codec specific. The catch is in a relatively slow connection between CPU and GPU. I had to implement practically the whole FLAC algorithm on the device, so that i won't have to transfer intermediate values between host and GPU, only the final result.

FLAC turned out to be very convenient for GPU. Probably the most convenient. One look at e.g. ALAC algorithm was enough to understand it can never get the same benefit.

QUOTE (hlloyge @ Sep 12 2009, 13:32) *
original: 578046380
flac: 344489508 in 80 seconds
cuda: 344226134 bytes in 00:00:52.8150209 seconds

Nice.

Thank you. And how about FlaCuda -5? It should provide enough compression to beat flac -8.
Dologan
Wow, nice work Gregory!

Just wondering... how did you get around the limitation mentioned by Garf earlier on this thread about GPUs only doing floating point and therefore not being suitable for lossless encoding?
Gregory S. Chudov
Current GPUs do integer computations quite alright.
Dologan
Hmm, does the encoder do pipe encoding (i.e. for proper foobar2000 use)?
Wombat
Some questions regarding Flaccuda.
Back when flake was new i had problems encoding at higher compression then standard Flac and playback on my Slimdevice.
Does Flaccuda use the same options at the corresponding compression level of Flac? At least it looks like i can play back Flaccuda -8 on my Slimdevice. How does it come it compresses better then?
Shouldnīt it be named "FlakeCuda" in the end?
Maurits
How hard would it be to convert this CUDA version into a more versatile OpenCL implementation? It is said that OpenCL is largely based on CUDA but non vendor-specific. That suggests it should be easy to adapt.

That way it wouldn't be limited to NVIDIA GPUs. In fact, it would even remove the limit of just using a GPU as OpenCL can combine all available GPUs and processor cores in the system as if it was one unit.
Gregory S. Chudov
QUOTE (Dologan @ Sep 12 2009, 15:15) *
Hmm, does the encoder do pipe encoding (i.e. for proper foobar2000 use)?

It does now (version 02), but i would suggest to be careful with your precious files while this is still an alfa version.

QUOTE (Wombat @ Sep 12 2009, 16:41) *
Some questions regarding Flaccuda.
Back when flake was new i had problems encoding at higher compression then standard Flac and playback on my Slimdevice.
Does Flaccuda use the same options at the corresponding compression level of Flac? At least it looks like i can play back Flaccuda -8 on my Slimdevice. How does it come it compresses better then?
Shouldnīt it be named "FlakeCuda" in the end?

It doesn't use the same options at the corresponding compression levels. But it does stick to a so called FLAC subset (supported by hardware devices) for compression levels 0-8. Compression levels 9-11 are non-subset, and might not play on some devices. Flake has the same conventions.

Better compression is achieved mainly by brute-force search of optimal compression parameters (stereo modes, LPC orders, and window functions). Flac does this only at level 8, and it only tries one window function, and not the best one.

As much as i'm greateful to Justin for his wonderful Flake encoder, but unlike my C# Flake port, FlaCuda is not a derivative work. Flake's algorithm was written for CPU, not GPU, and those are two very different realms. Flake does a great job at smart guessing the best compression parameters, while FlaCuda just makes a brute-force search on a GPU. FlaCuda however contains a C# Flake library, and uses it for FLAC decompression, if source file is flac, or if --verify mode is enabled.

QUOTE (Maurits @ Sep 12 2009, 17:48) *
How hard would it be to convert this Cuda version into a more versatile OpenCL implementation?

That way it wouldn't be limited to NVIDIA GPUs. In fact, it would even remove the limit of just using a GPU as OpenCL can combine all available GPUs and processor cores in the system as if it was one unit.

I'm not yet experienced enough in this matter, but i assume that this versatility will come for a price of speed. I will try to verify this later.
Wombat
Thanks for explaining it. Really a nice work you have done, thanks for that. Now i know what i can use Cuda for, it should really be mentioned on the nvidia Cuda pages.
Maurits
QUOTE (Gregory S. Chudov @ Sep 12 2009, 15:07) *
QUOTE (Maurits @ Sep 12 2009, 17:48) *
How hard would it be to convert this Cuda version into a more versatile OpenCL implementation?

That way it wouldn't be limited to NVIDIA GPUs. In fact, it would even remove the limit of just using a GPU as OpenCL can combine all available GPUs and processor cores in the system as if it was one unit.

I'm not yet experienced enough in this matter, but i assume that this versatility will come for a price of speed. I will try to verify this later.


That's possible. Although a performance hit might be offset by the fact that OpenCL combines the CPU and all available GPUs. The biggest difference I seem to find after some research is that there are a couple of things implemented in CUDA that OpenCL doesn't have yet. However, if you don't use any of these additional features for your specific implementation that wouldn't matter.

I must say that I am only speculating, I don't know much about this matter either, I was just wondering...
Dologan
QUOTE (Gregory S. Chudov @ Sep 12 2009, 15:07) *
QUOTE (Dologan @ Sep 12 2009, 15:15) *
Hmm, does the encoder do pipe encoding (i.e. for proper foobar2000 use)?

It does now (version 02), but i would suggest to be careful with your precious files while this is still an alfa version.

Wow, thanks! Pipe encoding seems to be working, with no differences in the decoded data of the resulting file. Speed for -8 is ~70x on my 8800GT vs ~40x on a single core of my Q6600. However, the resulting file appears to be lacking any length and bitrate information and so seeking is impossible.

Also, obviously foobar2000 isn't ready to properly handle GPU encoding. With the converter set to handle three simultaneous encoding processes for my quad core, FlaCuda actually slows down to around ~35x overall, whereas the standard Flac naturally scales well to ~110x

So yeah, not quite flac-replacement ready, then wink.gif Looks promising for inherently single thread things like rip+encodes, though (once/if it gets tagging arguments implemented, that is)
guruboolez
My results on a old Core2Duo E6300 and small Nvidia GeForce 9400 GT. I took two disks: one solo piano disc that compress very well (<400 kbps) and a baroque orchestral work that doesn't (750 kbps).


PIANO MUSIC

CODE
WAV        594.191 KB
FLAC -5    163.122 KB    49313 milliseconds    x69.94
FLAC -8    159.276 KB   116641 milliseconds    x29.57
CUDA -0    158.750 KB    60188 milliseconds    x57.30
CUDA -4    158.024 KB    88531 milliseconds    x38.96
CUDA -8    156.881 KB   176656 milliseconds    x19.52
CUDA 11    156.799 KB   527922 milliseconds     x6.53



VIVALDI

CODE
WAV        754.037 KB
FLAC -5    393.834 KB    68047 milliseconds    x64.32
FLAC -8    393.279 KB   160109 milliseconds    x27.33
CUDA -0    394.796 KB    78688 milliseconds    x55.62
CUDA -4    394.034 KB   111469 milliseconds    x39.26
CUDA -8    393.191 KB   223328 milliseconds    x19.59
CUDA 11    392.079 KB    675656 milliseconds    x6.47


On this cheap GPU, FlaCuda 0.2 performs rather well. It can't be as fast as the CPU but this encoder could approach this speed at -0 and sometimes compress better than flac.exe -8! Nevertheless the CPU has two cores and only one was used for this benchmark.
If I'm not wrong a similar 9400 GPU is used in the ION system. It means that cheap and powerless nettops or netbooks with ION chipset could perfectly be used for batch flac encoding. To be confirmed...

SMALL DECODING SPEED:

CODE
FLAC -8:   x409
CUDA -8:   x392
CUDA 11:   x285


As you can see there's a drastic fall in decoding speed with flacuda -11 (tested with latest foobar2000). On my Sansa Clip (2GB) the playback seems to be fine (I just tried one file though).


More tests are needed but it looks like a very interesting encoder which should work nicely on a ION chipset.
Garf
QUOTE (Gregory S. Chudov @ Sep 12 2009, 12:52) *
Current GPUs do integer computations quite alright.


It used to be so, on the Nvidia side, that you can only do 24 bit arithmetic, which might be enough for FLAC. I don't know about ATI. 32-bits (i.e. normal) arithmetic is only possible with a huge performance penalty.

New versions of CUDA or the cards might have changed this, or FLAC might have been simple enough that it wasn't an issue.

PS. Are these posts comparing multithreaded FLAC implementations on the host? (I don't know if those exist)
arri
Just finished my tests:

image.wav:
flac 1.2.1   -8 : 52 sec
flac-cuda  -8 : 32 sec

image.wav divided in 10 songs (1.wav, 2.wav etc.)
flac 1.2.1  -8 : 52 sec
flac-cuda  -8 : 32 sec

flac 1.2.1-icl : 30 sec

flac 1.2.1-icl is operating on both cores on my processor.
Intel Core 2 Duo E8500; Nvidia 8800 GT

flac 1.2.1-icl I found sometime ago somewhere in hydrogenaudio cool.gif
Wombat
QUOTE (arri @ Sep 12 2009, 22:16) *
Just finished my tests:

image.wav:
flac 1.2.1   -8 : 52 sec
flac-cuda  -8 : 32 sec

image.wav divided in 10 songs (1.wav, 2.wav etc.)
flac 1.2.1  -8 : 52 sec
flac-cuda  -8 : 32 sec

flac 1.2.1-icl : 30 sec

flac 1.2.1-icl is operating on both cores on my processor.
Intel Core 2 Duo E8500; Nvidia 8800 GT

flac 1.2.1-icl I found sometime ago somewhere in hydrogenaudio cool.gif


Afaik there isnīt a good Multi-Core version and i canīt believe a different compile can speed up by 75%. Please upload this version somewhere or link to its source.
arri
QUOTE (Wombat @ Sep 12 2009, 23:51) *
Afaik there isnīt a good Multi-Core version and i canīt believe a different compile can speed up by 75%. Please upload this version somewhere or link to its source.


I think those different flac encoders I have came from rarewares

Wombat
QUOTE (arri @ Sep 12 2009, 23:40) *
QUOTE (Wombat @ Sep 12 2009, 23:51) *
Afaik there isnīt a good Multi-Core version and i canīt believe a different compile can speed up by 75%. Please upload this version somewhere or link to its source.


I think those different flac encoders I have came from rarewares

It canīt be that compile and please donīt waste my time with trying some versions you link to cause you "think" it may be the one.
Case
I made a more thorough comparison with the new version. I combined a wav from 18 different genres giving hopefully a better representation of real abilities. This compares each compression mode. Horizontal scale is compression ratio and vertical scale is encoding speed vs realtime. With this test set CUDA version was more efficient starting from compression mode 6 but then only faster than FLAC's modes 7 and 8.
Click to view attachment
hlloyge
It sure isn't that compile, as they (at least for me) run at the same speed for -8.
alvaro84
I've done a quick test, how a 2+ year old full-fledged mainstream CPU (to be more precise: one core of it) stands against a pretty cheap, a little better than low-end GPU of its own era, both overclocked. The Core2 (E6420, Conroe core) duo runs at 3328Mhz with ddr2-832 cl4; the 8600GT runs at 580/1296/837MHz, this is all it can do with passive cooling (probably at a decreased core voltage).

CPU: 49.8x (3328/416MHz)
GPU: 69.4x (580/1296/837MHz)
GPU: 66.4x (540/1188/702MHz)

lv6:
GPU: 54.3x (580/1296/837MHz)

I've tested both -5 and -6 because for my test material file size with FLAC 1.2.1 -8 fell right between FLACuda -5 and -6.
Decoding speed (performed by fb2k):
1.1.2 -8: 615x
CUDA -5: 618x
CUDA -6: 572x
(FLACUDA -11 encoded much slower, ~12x; and it also decoded slower, ~300x)

Considering how insane performance (and extremely power hogging) GPUs are around these days, a GPU FLAC encoder seems a good idea.

I just found one glitch: the decoded voice data seems identical but the FLAC/Cuda files are not seekable in my fb2k 0.9.6.9. The parameters were -6 - -o %d
(OK, I see, I'm not alone with this problem)

[p.s. I also made a comparison with TAK -p2m what I regularly use: 77.7x encoding by one CPU core, 3.5% smaller (968 vs 1002kbps) and decodes at 384x speed - definitely slower than FLAC, except extreme FLACuda files]
Gregory S. Chudov
Thank you for detailed test results. Looking at them i decided to focus on optimizing performance at lower compression levels. Version 03 must be noticeably faster at levels 0..7. I also fixed the problem with files being unseekable when using pipe encoding from fb2k.
alvaro84
QUOTE (Gregory S. Chudov @ Sep 13 2009, 16:47) *
Thank you for detailed test results. Looking at them i decided to focus on optimizing performance at lower compression levels. Version 03 must be noticeably faster at levels 0..7. I also fixed the problem with files being unseekable when using pipe encoding from fb2k.


The good #1: the resulting FLAC is seekable!
The good #2: -6 is definitely faster, 60.4x vs 54.3x
The bad: the files are slightly larger, now I need -7 to get smaller result than Flac 1.1.2 -8 (CPU -8: 37810k; CUDA -6: 37857k; CUDA -7: 37791k)
The ugly: FLACuda -7 is slower than CPU FLAC -8. On my 'nose heavy' system, that is.

Hm, I probably should try with different tracks (my ad-hoc test sample is a ZUN theme from the Changeability of Strange Dream album, strictly speaking it's not a Touhou soundtrack, but similar to the game background music).
Is it the seek table that makes -6 files larger?

update: In case of a Rammstein track GPU -6 got smaller than CPU -8. Need more samples to test.
(Sorry, I was a bit hasty to post about it. Human error unsure.gif)
Gregory S. Chudov
QUOTE (alvaro84 @ Sep 13 2009, 22:33) *
Is it the seek table that makes -6 files larger?

Nope. Old-style -6 can be invoked by parameters "-5 -l 12". That's a lower-case L there, not a digit 1.
Case
Seems to me like other modes got a speed boost too:
Click to view attachment
Gregory S. Chudov
Phew. I think i finally squeezed everything i could out of it, at least for now.

Version 04 should be faster than anything.
Case
Impressive.
Click to view attachment
Gregory S. Chudov
Thank you.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.