IPB

Welcome Guest ( Log In | Register )

17 Pages V  < 1 2 3 4 5 > »   
Reply to this topicStart new topic
FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda), Formerly "lossless codecs and CUDA"
alvaro84
post Sep 15 2009, 04:57
Post #51





Group: Members
Posts: 128
Joined: 9-August 06
Member No.: 33830



QUOTE (Case @ Sep 14 2009, 21:18) *
Impressive.


Confirmed.

CODE
                  CPU -8    GPUv3 lv6   GPUv4 lv6   GPUv3 lv7   GPUv4 lv7   GPUv4 lv11
ZUN                49.8x       60.4x      71.4x       46.8x      52.0x      13.57x
Rammstein          49.5x       63.4x      74.7x       48.9x      54.3x      13.59x


And the file sizes are down again. Decode times don't seem to suffer against the CPU Flac-encoded files at similar sizes either (592x vs 600x and the file is 0.3% smaller).
Keep up the good work!
Go to the top of the page
+Quote Post
sauvage78
post Sep 15 2009, 05:24
Post #52





Group: Members
Posts: 677
Joined: 4-May 08
Member No.: 53282



So far I wasn't much interested in the small compression gain, but V0.4 speed gain is impressive.
Anyway I will not buy an Nvidia card just for Cuda... sadly I own a geforce 7600GS & my next graphic card will be integrated to my future Core I3 530 ... so I guess I will never use flacuda ;(

It makes me wonder how fast could be a multi-threaded flacuda -4 encoding runned on a sandy bridge octo-core with a geforce 300 wink.gif ... more than 16X faster compared to my old athlon XP 3000+ (barton) I guess wink.gif

The sad thing for flacuda is that in a near future cheap GPU will be integrated to low end CPU as soon as 2010 (clarkdale, 2 core+45mn GPU), & the middle-end CPU as soon as 2011 (sandy bridge, 4 core+32mn GPU) for intel, & AMD will follow (one year late as always), all these integrated GPU will have hardware acceleration for blu-ray video codecs so unless you're a die-hard gamer, buying an nvidia card will be a pure loss of money.

The coming years will be hard for nvidia. I am not even sure it will survive.

This post has been edited by sauvage78: Sep 15 2009, 06:03


--------------------
CDImage+CUE
Secure [Low/C2/AR(2)]
Flac -4
Go to the top of the page
+Quote Post
hlloyge
post Sep 15 2009, 06:50
Post #53





Group: Members
Posts: 689
Joined: 10-January 06
From: Zagreb
Member No.: 27018



I wouldn't kill nVidia just yet. AFAIK, as of now, it is the only card that supports GPU video transcoding, and it is heavily used in newer encoding applications, as well in new Photoshop for calculations of some effects.
While we are on the subject, where is this multithreaded flac encoder, BINARY, so I can test it?
Go to the top of the page
+Quote Post
sauvage78
post Sep 15 2009, 07:05
Post #54





Group: Members
Posts: 677
Joined: 4-May 08
Member No.: 53282



IMHO audio or video encoding will not help nvidia survive long because if the only purpose of buying GPU become to accelerating encoding then you'd better buy a higher-end CPU, being written in a lower process than GPU, CPU will always have the advantage in brute encoding force vs. power consumption & heat.

As for a multithreaded flac encoder, AFAIK there is none, I think I recall I read about some very experimental proof-of-concept code on some mailing list, but nothing serious.

Maybe we should start a donation to buy a quad core for Josh, it cannot be more useless than buying a PC for Klemm afterall wink.gif


--------------------
CDImage+CUE
Secure [Low/C2/AR(2)]
Flac -4
Go to the top of the page
+Quote Post
GeSomeone
post Sep 15 2009, 12:43
Post #55





Group: Members
Posts: 920
Joined: 22-October 01
From: the Netherlands
Member No.: 335



QUOTE (sauvage78 @ Sep 15 2009, 08:05) *
As for a multithreaded flac encoder, AFAIK there is none, ..

The simpelest way to use multi threading for any encoder is to run multiple encoders simultaneously (foobar2000 can do that). The number of usable threads depends on when the hard disk becomes the bottleneck.
Go to the top of the page
+Quote Post
Case
post Sep 15 2009, 20:34
Post #56





Group: Developer (Donating)
Posts: 2137
Joined: 19-October 01
From: Finland
Member No.: 322



Just converted my entire FLAC -8 library to FlaCuda -8 and it became 0.157% smaller. That's a bit smaller difference than my sample file suggested (0.166%).
Go to the top of the page
+Quote Post
Wombat
post Sep 16 2009, 08:45
Post #57





Group: Members
Posts: 950
Joined: 7-October 01
Member No.: 235



Seems like i found a strange behaviour. If you have a 16bit file not using all of them it gets much larger as with flac or even your CUETools.Flake.exe encoder.

For the example i used a 16bit file made it to 8bit and back to 16. So 8bit are unused.
All for -8

flac 1.21 8.561.886 Bytes
your flake 8.572.572 Bytes
FlaCuda 45.509.060 Bytes

This post has been edited by Wombat: Sep 16 2009, 08:46
Go to the top of the page
+Quote Post
odyssey
post Sep 16 2009, 09:11
Post #58





Group: Members
Posts: 2296
Joined: 18-May 03
From: Denmark
Member No.: 6695



QUOTE (GeSomeone @ Sep 15 2009, 13:43) *
QUOTE (sauvage78 @ Sep 15 2009, 08:05) *
As for a multithreaded flac encoder, AFAIK there is none, ..

The simpelest way to use multi threading for any encoder is to run multiple encoders simultaneously (foobar2000 can do that). The number of usable threads depends on when the hard disk becomes the bottleneck.

Now we just need a way to simultanously run CUDA and CPU encoders wink.gif


--------------------
Can't wait for a HD-AAC encoder :P
Go to the top of the page
+Quote Post
[JAZ]
post Sep 16 2009, 18:36
Post #59





Group: Members
Posts: 1710
Joined: 24-June 02
From: Catalunya(Spain)
Member No.: 2383



QUOTE (Wombat @ Sep 16 2009, 09:45) *
Seems like i found a strange behaviour. If you have a 16bit file not using all of them it gets much larger as with flac or even your CUETools.Flake.exe encoder.


Could you try to encode a lossywav-processed file to see if it shows the same behaviour? ( In this case, wav -> flacuda would have the same size than wav -> lossywav -> flacuda).

If that is true, it would seem that this FLAC implementation misses part of the specification (and maybe could reduce the size even further).
Go to the top of the page
+Quote Post
lvqcl
post Sep 16 2009, 19:09
Post #60





Group: Developer
Posts: 3214
Joined: 2-December 07
Member No.: 49183



FlaCuda_0.4 with "-8" switch, original test file: 975 kbps;

After LossyWAV --standard:

FlaCuda_0.4 -8: 996 kbps;
FlaCuda_0.4 -8 -b 512: 1011 kbps.

Flake_0.11 -8: 1000 kbps (Flake encoder from Winamp Essentials Pack 5.55).

Flac_1.2.1 -5 -b 512: 462 kbps.

This post has been edited by lvqcl: Sep 16 2009, 19:10
Go to the top of the page
+Quote Post
Wombat
post Sep 16 2009, 19:14
Post #61





Group: Members
Posts: 950
Joined: 7-October 01
Member No.: 235



QUOTE (lvqcl @ Sep 16 2009, 20:09) *
FlaCuda_0.4 with "-8" switch, original test file: 975 kbps;

After LossyWAV --standard:

FlaCuda_0.4 -8: 996 kbps;
FlaCuda_0.4 -8 -b 512: 1011 kbps.

Flake_0.11 -8: 1000 kbps (Flake encoder from Winamp Essentials Pack 5.55).

Flac_1.2.1 -5 -b 512: 462 kbps.


I can second that behaviour. One thing is that Mr. Chudovs CUETools.Flake.exe does compress ~as good as Flac 1.21 on lossywav also.
So hopefully it will be only need a small fix.
Go to the top of the page
+Quote Post
alvaro84
post Sep 17 2009, 08:51
Post #62





Group: Members
Posts: 128
Joined: 9-August 06
Member No.: 33830



QUOTE (odyssey @ Sep 16 2009, 09:11) *
Now we just need a way to simultanously run CUDA and CPU encoders wink.gif


Sounds fun, though I'm afraid we'd bump into a strong bottleneck because of disk head positioning sad.gif Even converting with 2 threads one HDD seeks like crazy - but it's still a lot faster than 1 thread.
NCQ in AHCI mode should help a lot with more threads, but it didn't when I tested it a while ago. Physically different source/target drives can alleviate this bottleneck quite a bit.
Fast SSDs are worth a try too huh.gif
This CUDA encoder can be a different solution, in case of one instance it's faster than the reference encoder running on one core of my CPU (converting one file at a time is the least disk-bottlenecked way to do it).
A natively multithreaded CPU-based encoder (working on segments of one single track) is another option.
Go to the top of the page
+Quote Post
Gregory S. Chudo...
post Sep 17 2009, 23:12
Post #63





Group: Developer
Posts: 683
Joined: 2-October 08
From: Ottawa
Member No.: 59035



Added lossyWav support. It shouldn't make any difference for normal wavs.


--------------------
CUETools 2.1.4
Go to the top of the page
+Quote Post
Wombat
post Sep 17 2009, 23:30
Post #64





Group: Members
Posts: 950
Joined: 7-October 01
Member No.: 235



Thanks for your fast work on that!!
Works flawless now on the 8bit and lossywav file.
Using it on normal files at -8 gets even a few bytes smaller as 0.4 smile.gif

Edit: For the ones that use lossywav. Standard flac seems to compress a bit better on that but not much.

This post has been edited by Wombat: Sep 17 2009, 23:43
Go to the top of the page
+Quote Post
Gregory S. Chudo...
post Sep 18 2009, 19:05
Post #65





Group: Developer
Posts: 683
Joined: 2-October 08
From: Ottawa
Member No.: 59035



Is there anybody here who knows the math behind Cholesky decomposition used in ffmpeg as an alternative method of LPC coefficients search?
This method is too slow for CPU, but i thought i'd give it a shot on GPU.
The problem is, GPU doesn't do double precision very well.
The lls code from ffmpeg doesn't work on single precision floats due to overflows.
My first idea was to scale down the signal to avoid overflows, but results were poor.
There's something i don't understand about this algorithm: in theory, LPC coeffs shouldn't depend on the scale of the signal - after all, they are linear smile.gif
I have a suspicion that in practice this algorithm does depend on the scale of the signal a lot. I don't pretend to understand this math, but:
First suspicious piece of code is this (from av_solve_lls):

CODE
            double sum= covar[i][j];
            for(k=i-1; k>=0; k--)
                sum -= factor[i][k]*factor[j][k];


When the signal is multiplied by 10, covar[i][j] is multiplied by 100, and both factor[i][k] and factor[j][k] are multiplied by 100, so factor[i][k]*factor[j][k] is multiplied by 10000. So this sum doesn't scale in any predictable fashion.

I also don't understand this magic 'threshold' business.

CODE
                if(sum < threshold)
                    sum= 1.0;


How should the threshold scale with the signal? Should the sum always be set to 1.0 if it's below threshold, or to some value depending on the scale of the signal? Or am i on the wrong track completely?

I also found this old post from Josh:
QUOTE (jcoalson @ Jul 24 2006, 10:04) *
I have actually been doing experiments solving the full prediction linear system with SVD; this should give a lower bound on the compression achievable by the FLAC filter.

Is there any working code left from those experiments, and how successful were they?

This post has been edited by Gregory S. Chudov: Sep 18 2009, 19:23


--------------------
CUETools 2.1.4
Go to the top of the page
+Quote Post
Gregory S. Chudo...
post Sep 18 2009, 19:54
Post #66





Group: Developer
Posts: 683
Joined: 2-October 08
From: Ottawa
Member No.: 59035



I must add, that when computations are done in double precision, lls coeffs do not depend much on the scale of the signal, so the algorithm works, despite non-linear scaling of intermediate values.
But in single precision they start to drift much more. Which is wierd, because in literature Cholesky decomposition is said to be more stable than Levinson-Durbin recursion, with regard to rounding errors.

Here is a sample of this drift in double precision:
CODE
SCALE: 1.0/6 COEFF[31,0..2]: 0.523100 0.287037 0.204438; COVAR[31,0..2]: 43226.383239 170398.007602 -241511.245261
SCALE: 1.0/7 COEFF[31,0..2]: 0.523086 0.287057 0.204432; COVAR[31,0..2]: 37051.186185 146055.437263 -207009.641880


This post has been edited by Gregory S. Chudov: Sep 20 2009, 16:46


--------------------
CUETools 2.1.4
Go to the top of the page
+Quote Post
skamp
post Sep 19 2009, 09:38
Post #67





Group: Developer
Posts: 1344
Joined: 4-May 04
From: France
Member No.: 13875



QUOTE (alvaro84 @ Sep 17 2009, 09:51) *
Sounds fun, though I'm afraid we'd bump into a strong bottleneck because of disk head positioning sad.gif Even converting with 2 threads one HDD seeks like crazy - but it's still a lot faster than 1 thread. […] A natively multithreaded CPU-based encoder (working on segments of one single track) is another option.

Ideally you would run multiple instances of a single-threaded encoder (one track per CPU core) and one instance of the CUDA encoder per GPU at the same time - it's just a matter of making sure that all instances are kept busy.

When the number of remaining tracks gets lower than the number of available cores, you prioritize the GPU instance (since it's faster than a single-threaded encoder on a single CPU core), but also run (if available) a multi-threaded encoder; one MT encoder over two cores is likely to be slower than two instances of a ST encoder over the same number of cores (see the Lancer builds of the Ogg Vorbis encoder). In other words, an MT encoder is particularly useful for keeping CPU cores busy when the workload dries up.

In short, the priorities go like this (if you have a multi-core CPU, that is):
ST * n CPU cores > GPU > MT

As for the I/O bottlenecks, that's when a large enough RAMdisk comes in very handy. Even just 1GiB is often enough for encoding a whole album (WAV + FLAC or FLAC + Ogg Vorbis or whatever on the RAMdisk).

I already use all available CPU cores when I encode my rips to FLAC or any other codec (one track per core); what I could really use, even before a MT FLAC encoder comes up, is a simple, command-line, multi-threaded Replay Gain utility. As I've said in the past, computing RG values on an album now takes longer than encoding it in the first place (because the former uses only one core while the latter uses all 4 cores on my quadcore CPU).


--------------------
caudec.net
Go to the top of the page
+Quote Post
alvaro84
post Sep 19 2009, 13:04
Post #68





Group: Members
Posts: 128
Joined: 9-August 06
Member No.: 33830



QUOTE (skamp @ Sep 19 2009, 09:38) *
As for the I/O bottlenecks, that's when a large enough RAMdisk comes in very handy. Even just 1GiB is often enough for encoding a whole album (WAV + FLAC or FLAC + Ogg Vorbis or whatever on the RAMdisk).


You're absolutely right, I don't know how I could forget about RAMdisks. I used them all the time when 8MiB felt plenty of RAM, but somehow I never thought about them since we have multiple GiBs at our disposal... talk about contradictions...
Go to the top of the page
+Quote Post
gib
post Sep 19 2009, 22:39
Post #69





Group: Members
Posts: 227
Joined: 20-January 03
From: A Tropical Isle
Member No.: 4640



I've gotten flacuda to work with the old but still handy Flac Frontend. The only little issue is that flacuda doesn't recognize the -V option as verify like the flac.exe does, so I can't use the verify checkbox in the Frontend. It's a tiny thing, but it would be cool if, maybe along with a future update, -V was added to flacuda. If not, I'll just go about setting it up to work with Foobar.

Thank you again, Gregory. Very cool stuff.
Go to the top of the page
+Quote Post
Wombat
post Sep 19 2009, 23:10
Post #70





Group: Members
Posts: 950
Joined: 7-October 01
Member No.: 235



QUOTE (gib @ Sep 19 2009, 23:39) *
I've gotten flacuda to work with the old but still handy Flac Frontend. The only little issue is that flacuda doesn't recognize the -V option as verify like the flac.exe does, so I can't use the verify checkbox in the Frontend. It's a tiny thing, but it would be cool if, maybe along with a future update, -V was added to flacuda. If not, I'll just go about setting it up to work with Foobar.

Thank you again, Gregory. Very cool stuff.

Since you canīt use replaygain with Flac Frontend and FlaCuda and you still want its simple layout just try Multi-Frontend from the same author. There you can define your line with --verify.
I even resurrected frontah for mirroring old files to new folder and FlaCuda and tags with one click. Its ini is simple to adjust to make it work. To sad frontah developement was stopped.

Edit: When anyone recommends foobar now, please tell me how you can simple mirror (reencode) folders + copying Tag + replaygaininfo in one go. I didnīt manage to do it that simple but i read here and there "Use foobar" but no detailed info how. Maybe i do misunderstand its functionality.

Edit2:
Finished the reencode of my collection. Since i used flac 1.10-1.21, flake and some other builds over the years i suppose it is of no use to calculate my space savings as a guiding value.
On some albums there were big savings. A few albums come out bigger, mainly very silent music or with many silent parts in it. I can imagine on some collections with special kinds of music it wonīt save as much space as expected.

This post has been edited by Wombat: Sep 20 2009, 00:09
Go to the top of the page
+Quote Post
gib
post Sep 20 2009, 00:41
Post #71





Group: Members
Posts: 227
Joined: 20-January 03
From: A Tropical Isle
Member No.: 4640



Wombat, thanks for the suggestion of using Multi Frontend. I can't believe that I haven't downloaded it before. Thanks again!
Go to the top of the page
+Quote Post
Gregory S. Chudo...
post Sep 25 2009, 23:45
Post #72





Group: Developer
Posts: 683
Joined: 2-October 08
From: Ottawa
Member No.: 59035



Here is version that is a tiny bit faster, i hope. Since for previous version HDD is a bottleneck, i was able to measure the speed improvement only when using RAMDisk.

I'm still curious about alternative algorithms to Levinson-Durbin (i commented above on my problems with ffmpeg's least-square model). Any help would be appreciated.

This post has been edited by Gregory S. Chudov: Sep 25 2009, 23:47


--------------------
CUETools 2.1.4
Go to the top of the page
+Quote Post
Wombat
post Sep 26 2009, 01:58
Post #73





Group: Members
Posts: 950
Joined: 7-October 01
Member No.: 235



I eally canīt tell if your FlaCuda became any faster cause it was damn fast before. All i can say it is kind of fun having the GPU doing its job while you donīt notice your system being under heavy stress. So encoding with FlaCuda you can still do heavy tasks in Front. I love it.
Go to the top of the page
+Quote Post
Case
post Sep 26 2009, 11:19
Post #74





Group: Developer (Donating)
Posts: 2137
Joined: 19-October 01
From: Finland
Member No.: 322



This is getting ridiculous. New FlaCuda 0.6 is faster even in mode -8 than 0.4 was in -0.
Attached Image
Go to the top of the page
+Quote Post
alvaro84
post Sep 26 2009, 12:54
Post #75





Group: Members
Posts: 128
Joined: 9-August 06
Member No.: 33830



I second, it's ridiculous biggrin.gif
Now FlaCuda 0.6 at -6 is a almost as fast as Flac 1.2.1 -8 running on two threads... and this is a stock 8600GT standing up against a pretty much overclocked core2 duo... If I give the geforce a little bit of overclock, it comes out faster than the 2 instances of Flac1.2.1 together... the file sizes are even a bit smaller than with the CPU encoder and there are 'more hardcore' settings... it's true that heavier compression takes a toll on decoding speed too, so I stick with the original -8-ish compression when I use FLAC.
TAK is somewhat slower to decode, but it compresses better than even FlaCuda does at -11 and that's beyond the speed crossover point: that -11 FLAC is slower to decode than the -p2m TAK (which is 18kbps smaller in case of my test material).
No, it's not a TAK marketing remark, I'm just testing that too, it's interesting for me to compare these codecs.

This post has been edited by alvaro84: Sep 26 2009, 12:55
Go to the top of the page
+Quote Post

17 Pages V  < 1 2 3 4 5 > » 
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 21st April 2014 - 14:34