FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda), Formerly "lossless codecs and CUDA" |
![]() ![]() |
FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda), Formerly "lossless codecs and CUDA" |
Sep 12 2009, 11:40
Post
#26
|
|
![]() Group: Members (Donating) Posts: 478 Joined: 22-November 01 From: United Kingdom Member No.: 519 |
Wow, nice work Gregory!
Just wondering... how did you get around the limitation mentioned by Garf earlier on this thread about GPUs only doing floating point and therefore not being suitable for lossless encoding? |
|
|
|
Sep 12 2009, 11:52
Post
#27
|
|
![]() Group: Developer Posts: 648 Joined: 2-October 08 From: Ottawa Member No.: 59035 |
Current GPUs do integer computations quite alright.
-------------------- CUETools 2.1.4
|
|
|
|
Sep 12 2009, 12:15
Post
#28
|
|
![]() Group: Members (Donating) Posts: 478 Joined: 22-November 01 From: United Kingdom Member No.: 519 |
Hmm, does the encoder do pipe encoding (i.e. for proper foobar2000 use)?
|
|
|
|
Sep 12 2009, 13:41
Post
#29
|
|
![]() Group: Members Posts: 840 Joined: 7-October 01 Member No.: 235 |
Some questions regarding Flaccuda.
Back when flake was new i had problems encoding at higher compression then standard Flac and playback on my Slimdevice. Does Flaccuda use the same options at the corresponding compression level of Flac? At least it looks like i can play back Flaccuda -8 on my Slimdevice. How does it come it compresses better then? Shouldnīt it be named "FlakeCuda" in the end? |
|
|
|
Sep 12 2009, 14:48
Post
#30
|
|
|
Group: Members Posts: 326 Joined: 30-September 05 From: London, Europe Member No.: 24805 |
How hard would it be to convert this CUDA version into a more versatile OpenCL implementation? It is said that OpenCL is largely based on CUDA but non vendor-specific. That suggests it should be easy to adapt.
That way it wouldn't be limited to NVIDIA GPUs. In fact, it would even remove the limit of just using a GPU as OpenCL can combine all available GPUs and processor cores in the system as if it was one unit. This post has been edited by Maurits: Sep 12 2009, 14:59 |
|
|
|
Sep 12 2009, 15:07
Post
#31
|
|
![]() Group: Developer Posts: 648 Joined: 2-October 08 From: Ottawa Member No.: 59035 |
Hmm, does the encoder do pipe encoding (i.e. for proper foobar2000 use)? It does now (version 02), but i would suggest to be careful with your precious files while this is still an alfa version. Some questions regarding Flaccuda. Back when flake was new i had problems encoding at higher compression then standard Flac and playback on my Slimdevice. Does Flaccuda use the same options at the corresponding compression level of Flac? At least it looks like i can play back Flaccuda -8 on my Slimdevice. How does it come it compresses better then? Shouldnīt it be named "FlakeCuda" in the end? It doesn't use the same options at the corresponding compression levels. But it does stick to a so called FLAC subset (supported by hardware devices) for compression levels 0-8. Compression levels 9-11 are non-subset, and might not play on some devices. Flake has the same conventions. Better compression is achieved mainly by brute-force search of optimal compression parameters (stereo modes, LPC orders, and window functions). Flac does this only at level 8, and it only tries one window function, and not the best one. As much as i'm greateful to Justin for his wonderful Flake encoder, but unlike my C# Flake port, FlaCuda is not a derivative work. Flake's algorithm was written for CPU, not GPU, and those are two very different realms. Flake does a great job at smart guessing the best compression parameters, while FlaCuda just makes a brute-force search on a GPU. FlaCuda however contains a C# Flake library, and uses it for FLAC decompression, if source file is flac, or if --verify mode is enabled. How hard would it be to convert this Cuda version into a more versatile OpenCL implementation? That way it wouldn't be limited to NVIDIA GPUs. In fact, it would even remove the limit of just using a GPU as OpenCL can combine all available GPUs and processor cores in the system as if it was one unit. I'm not yet experienced enough in this matter, but i assume that this versatility will come for a price of speed. I will try to verify this later. This post has been edited by Gregory S. Chudov: Sep 12 2009, 15:24 -------------------- CUETools 2.1.4
|
|
|
|
Sep 12 2009, 15:39
Post
#32
|
|
![]() Group: Members Posts: 840 Joined: 7-October 01 Member No.: 235 |
Thanks for explaining it. Really a nice work you have done, thanks for that. Now i know what i can use Cuda for, it should really be mentioned on the nvidia Cuda pages.
|
|
|
|
Sep 12 2009, 15:57
Post
#33
|
|
|
Group: Members Posts: 326 Joined: 30-September 05 From: London, Europe Member No.: 24805 |
How hard would it be to convert this Cuda version into a more versatile OpenCL implementation? That way it wouldn't be limited to NVIDIA GPUs. In fact, it would even remove the limit of just using a GPU as OpenCL can combine all available GPUs and processor cores in the system as if it was one unit. I'm not yet experienced enough in this matter, but i assume that this versatility will come for a price of speed. I will try to verify this later. That's possible. Although a performance hit might be offset by the fact that OpenCL combines the CPU and all available GPUs. The biggest difference I seem to find after some research is that there are a couple of things implemented in CUDA that OpenCL doesn't have yet. However, if you don't use any of these additional features for your specific implementation that wouldn't matter. I must say that I am only speculating, I don't know much about this matter either, I was just wondering... |
|
|
|
Sep 12 2009, 17:40
Post
#34
|
|
![]() Group: Members (Donating) Posts: 478 Joined: 22-November 01 From: United Kingdom Member No.: 519 |
Hmm, does the encoder do pipe encoding (i.e. for proper foobar2000 use)? It does now (version 02), but i would suggest to be careful with your precious files while this is still an alfa version. Wow, thanks! Pipe encoding seems to be working, with no differences in the decoded data of the resulting file. Speed for -8 is ~70x on my 8800GT vs ~40x on a single core of my Q6600. However, the resulting file appears to be lacking any length and bitrate information and so seeking is impossible. Also, obviously foobar2000 isn't ready to properly handle GPU encoding. With the converter set to handle three simultaneous encoding processes for my quad core, FlaCuda actually slows down to around ~35x overall, whereas the standard Flac naturally scales well to ~110x So yeah, not quite flac-replacement ready, then This post has been edited by Dologan: Sep 12 2009, 18:02 |
|
|
|
Sep 12 2009, 20:09
Post
#35
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
My results on a old Core2Duo E6300 and small Nvidia GeForce 9400 GT. I took two disks: one solo piano disc that compress very well (<400 kbps) and a baroque orchestral work that doesn't (750 kbps).
PIANO MUSIC CODE WAV 594.191 KB FLAC -5 163.122 KB 49313 milliseconds x69.94 FLAC -8 159.276 KB 116641 milliseconds x29.57 CUDA -0 158.750 KB 60188 milliseconds x57.30 CUDA -4 158.024 KB 88531 milliseconds x38.96 CUDA -8 156.881 KB 176656 milliseconds x19.52 CUDA 11 156.799 KB 527922 milliseconds x6.53 VIVALDI CODE WAV 754.037 KB FLAC -5 393.834 KB 68047 milliseconds x64.32 FLAC -8 393.279 KB 160109 milliseconds x27.33 CUDA -0 394.796 KB 78688 milliseconds x55.62 CUDA -4 394.034 KB 111469 milliseconds x39.26 CUDA -8 393.191 KB 223328 milliseconds x19.59 CUDA 11 392.079 KB 675656 milliseconds x6.47 On this cheap GPU, FlaCuda 0.2 performs rather well. It can't be as fast as the CPU but this encoder could approach this speed at -0 and sometimes compress better than flac.exe -8! Nevertheless the CPU has two cores and only one was used for this benchmark. If I'm not wrong a similar 9400 GPU is used in the ION system. It means that cheap and powerless nettops or netbooks with ION chipset could perfectly be used for batch flac encoding. To be confirmed... SMALL DECODING SPEED: CODE FLAC -8: x409 CUDA -8: x392 CUDA 11: x285 As you can see there's a drastic fall in decoding speed with flacuda -11 (tested with latest foobar2000). On my Sansa Clip (2GB) the playback seems to be fine (I just tried one file though). More tests are needed but it looks like a very interesting encoder which should work nicely on a ION chipset. |
|
|
|
Sep 12 2009, 21:35
Post
#36
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
Current GPUs do integer computations quite alright. It used to be so, on the Nvidia side, that you can only do 24 bit arithmetic, which might be enough for FLAC. I don't know about ATI. 32-bits (i.e. normal) arithmetic is only possible with a huge performance penalty. New versions of CUDA or the cards might have changed this, or FLAC might have been simple enough that it wasn't an issue. PS. Are these posts comparing multithreaded FLAC implementations on the host? (I don't know if those exist) |
|
|
|
Sep 12 2009, 22:16
Post
#37
|
|
![]() Group: Members Posts: 9 Joined: 22-October 07 Member No.: 48099 |
Just finished my tests:
image.wav: flac 1.2.1 -8 : 52 sec flac-cuda -8 : 32 sec image.wav divided in 10 songs (1.wav, 2.wav etc.) flac 1.2.1 -8 : 52 sec flac-cuda -8 : 32 sec flac 1.2.1-icl : 30 sec flac 1.2.1-icl is operating on both cores on my processor. Intel Core 2 Duo E8500; Nvidia 8800 GT flac 1.2.1-icl I found sometime ago somewhere in hydrogenaudio |
|
|
|
Sep 12 2009, 22:51
Post
#38
|
|
![]() Group: Members Posts: 840 Joined: 7-October 01 Member No.: 235 |
Just finished my tests: image.wav: flac 1.2.1 -8 : 52 sec flac-cuda -8 : 32 sec image.wav divided in 10 songs (1.wav, 2.wav etc.) flac 1.2.1 -8 : 52 sec flac-cuda -8 : 32 sec flac 1.2.1-icl : 30 sec flac 1.2.1-icl is operating on both cores on my processor. Intel Core 2 Duo E8500; Nvidia 8800 GT flac 1.2.1-icl I found sometime ago somewhere in hydrogenaudio Afaik there isnīt a good Multi-Core version and i canīt believe a different compile can speed up by 75%. Please upload this version somewhere or link to its source. |
|
|
|
Sep 12 2009, 23:40
Post
#39
|
|
![]() Group: Members Posts: 9 Joined: 22-October 07 Member No.: 48099 |
Afaik there isnīt a good Multi-Core version and i canīt believe a different compile can speed up by 75%. Please upload this version somewhere or link to its source. I think those different flac encoders I have came from rarewares |
|
|
|
Sep 12 2009, 23:53
Post
#40
|
|
![]() Group: Members Posts: 840 Joined: 7-October 01 Member No.: 235 |
Afaik there isnīt a good Multi-Core version and i canīt believe a different compile can speed up by 75%. Please upload this version somewhere or link to its source. I think those different flac encoders I have came from rarewares It canīt be that compile and please donīt waste my time with trying some versions you link to cause you "think" it may be the one. |
|
|
|
Sep 13 2009, 10:00
Post
#41
|
||
|
Group: Developer (Donating) Posts: 2040 Joined: 19-October 01 From: Finland Member No.: 322 |
I made a more thorough comparison with the new version. I combined a wav from 18 different genres giving hopefully a better representation of real abilities. This compares each compression mode. Horizontal scale is compression ratio and vertical scale is encoding speed vs realtime. With this test set CUDA version was more efficient starting from compression mode 6 but then only faster than FLAC's modes 7 and 8.
|
|
|
|
||
Sep 13 2009, 14:36
Post
#42
|
|
![]() Group: Members Posts: 648 Joined: 10-January 06 From: Zagreb Member No.: 27018 |
It sure isn't that compile, as they (at least for me) run at the same speed for -8.
|
|
|
|
Sep 13 2009, 16:19
Post
#43
|
|
![]() Group: Members Posts: 128 Joined: 9-August 06 Member No.: 33830 |
I've done a quick test, how a 2+ year old full-fledged mainstream CPU (to be more precise: one core of it) stands against a pretty cheap, a little better than low-end GPU of its own era, both overclocked. The Core2 (E6420, Conroe core) duo runs at 3328Mhz with ddr2-832 cl4; the 8600GT runs at 580/1296/837MHz, this is all it can do with passive cooling (probably at a decreased core voltage).
CPU: 49.8x (3328/416MHz) GPU: 69.4x (580/1296/837MHz) GPU: 66.4x (540/1188/702MHz) lv6: GPU: 54.3x (580/1296/837MHz) I've tested both -5 and -6 because for my test material file size with FLAC 1.2.1 -8 fell right between FLACuda -5 and -6. Decoding speed (performed by fb2k): 1.1.2 -8: 615x CUDA -5: 618x CUDA -6: 572x (FLACUDA -11 encoded much slower, ~12x; and it also decoded slower, ~300x) Considering how insane performance (and extremely power hogging) GPUs are around these days, a GPU FLAC encoder seems a good idea. I just found one glitch: the decoded voice data seems identical but the FLAC/Cuda files are not seekable in my fb2k 0.9.6.9. The parameters were -6 - -o %d (OK, I see, I'm not alone with this problem) [p.s. I also made a comparison with TAK -p2m what I regularly use: 77.7x encoding by one CPU core, 3.5% smaller (968 vs 1002kbps) and decodes at 384x speed - definitely slower than FLAC, except extreme FLACuda files] |
|
|
|
Sep 13 2009, 16:47
Post
#44
|
|
![]() Group: Developer Posts: 648 Joined: 2-October 08 From: Ottawa Member No.: 59035 |
Thank you for detailed test results. Looking at them i decided to focus on optimizing performance at lower compression levels. Version 03 must be noticeably faster at levels 0..7. I also fixed the problem with files being unseekable when using pipe encoding from fb2k.
This post has been edited by Gregory S. Chudov: Sep 13 2009, 16:50 -------------------- CUETools 2.1.4
|
|
|
|
Sep 13 2009, 19:33
Post
#45
|
|
![]() Group: Members Posts: 128 Joined: 9-August 06 Member No.: 33830 |
Thank you for detailed test results. Looking at them i decided to focus on optimizing performance at lower compression levels. Version 03 must be noticeably faster at levels 0..7. I also fixed the problem with files being unseekable when using pipe encoding from fb2k. The good #1: the resulting FLAC is seekable! The good #2: -6 is definitely faster, 60.4x vs 54.3x The bad: the files are slightly larger, now I need -7 to get smaller result than Flac 1.1.2 -8 (CPU -8: 37810k; CUDA -6: 37857k; CUDA -7: 37791k) The ugly: FLACuda -7 is slower than CPU FLAC -8. On my 'nose heavy' system, that is. Hm, I probably should try with different tracks (my ad-hoc test sample is a ZUN theme from the Changeability of Strange Dream album, strictly speaking it's not a Touhou soundtrack, but similar to the game background music). Is it the seek table that makes -6 files larger? update: In case of a Rammstein track GPU -6 got smaller than CPU -8. Need more samples to test. (Sorry, I was a bit hasty to post about it. Human error This post has been edited by alvaro84: Sep 13 2009, 19:38 |
|
|
|
Sep 13 2009, 19:42
Post
#46
|
|
![]() Group: Developer Posts: 648 Joined: 2-October 08 From: Ottawa Member No.: 59035 |
Is it the seek table that makes -6 files larger? Nope. Old-style -6 can be invoked by parameters "-5 -l 12". That's a lower-case L there, not a digit 1. This post has been edited by Gregory S. Chudov: Sep 13 2009, 19:43 -------------------- CUETools 2.1.4
|
|
|
|
Sep 13 2009, 21:39
Post
#47
|
|
|
Group: Developer (Donating) Posts: 2040 Joined: 19-October 01 From: Finland Member No.: 322 |
|
|
|
|
Sep 14 2009, 20:25
Post
#48
|
|
![]() Group: Developer Posts: 648 Joined: 2-October 08 From: Ottawa Member No.: 59035 |
Phew. I think i finally squeezed everything i could out of it, at least for now.
Version 04 should be faster than anything. -------------------- CUETools 2.1.4
|
|
|
|
Sep 14 2009, 21:18
Post
#49
|
|
|
Group: Developer (Donating) Posts: 2040 Joined: 19-October 01 From: Finland Member No.: 322 |
|
|
|
|
Sep 14 2009, 21:26
Post
#50
|
|
![]() Group: Developer Posts: 648 Joined: 2-October 08 From: Ottawa Member No.: 59035 |
Thank you.
-------------------- CUETools 2.1.4
|
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 22nd May 2013 - 23:48 |