Help - Search - Members - Calendar
Full Version: FLAC, SSE3 support?
Hydrogenaudio Forums > Lossless Audio Compression > FLAC
Pages: 1, 2
gottkaiser
Is there any SSE3 compile of FLAC v1.1.3? Like the OGG Vorbis compile from Lancer: http://homepage3.nifty.com/blacksword/.
Or is it even possible to crate one?

Thanks for reply.
Egor
There is an optimized ICC 9.1 compile, which is significantly faster on the latest Intel processors.
Download here, see the discussion: P4 Win32 Compile of Flac 1.1.3.
gottkaiser
QUOTE(Egor @ Jan 14 2007, 14:46) *

There is an optimized ICC 9.1 compile, which is significantly faster on the latest Intel processors.
Download here, see the discussion: P4 Win32 Compile of Flac 1.1.3.


But it' looks like it doesn't work for AMD CPUs.
gharris999
No, that one ought to work with AMD cpus. Try it and see.
philaphonic
This compile kicks ass. Works great. What a speed increase on my Core 2 Duo E6600!

Good work!
gottkaiser
It is not working on An AMD system.
Is it not possible to do an extra compile for an AMD CPU? (please) :-)
HbG
If you want a faster flac, try flake (discussion thread)
gottkaiser
Not flake! I wan't the original flac. Just an SSE3 optimised compile.
Synthetic Soul
Works fine on my Athlon at home (CPUZ report), and here at work (CPUZ report).

One curiosity though:

CODE
flac+ -o plus.flac f.wav && flac -o flac.flac f.wav

flac 1.1.3, Copyright (C) 2000,2001,2002,2003,2004,2005,2006  Josh Coalson
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

f.wav: wrote 199097 bytes, ratio=0.469

flac 1.1.3, Copyright (C) 2000,2001,2002,2003,2004,2005,2006  Josh Coalson
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

f.wav: wrote 199233 bytes, ratio=0.469

NB: foobar reports files as bit-identical. I guess this comes downto floating point calculation... or something.
yong
afaik flac only optimized with sse.(may be mmx and 3dnow too)
gottkaiser
QUOTE(Synthetic Soul @ Jan 15 2007, 14:45) *
Works fine on my Athlon at home (CPUZ report), and here at work (CPUZ report).


What works fine? The P4 Win32 Compile of Flac 1.1.3?
With my CPU it's not working.


System: AMD Sempron64 2800+ (Palermo), WinXP

Any suggestions?
Synthetic Soul
QUOTE(gottkaiser @ Jan 15 2007, 13:57) *
What works fine? The P4 Win32 Compile of Flac 1.1.3?
Yes, gharris999's compile, found here (link taken from his post here).
Martin H
When placed in the app directory, then the msvcr80.dll file needs a manifest file included in the app directory for being able to run(only VC8 runtimes needs this), if the VC8 runtimes haven't been installed previously off course.

VC8 runtimes install :
http://www.microsoft.com/downloads/details...;displaylang=en

The compile does work on AMDs Also, as it's compiled with the ICL flag /QaxN.

@gharris999

I'm no expert, but wouldn't it be better to use the /QaxW flag instead of the /QaxN ? From the docs, it seems to me that the /QaxN flag makes optimized code for Intel only SSE2 CPUs while also making non-processor specific generic code for non-SSE2 Intel CPUs + all AMD CPUs with or without SSE2, but the /QaxW flag, makes optimized code for both Intel SSE2 CPUs + AMD SSE2 CPUs like e.g. the Athlon 64, so that flag supports both Intel + AMD SSE2 and reverts to generic for everything else, without using either SSE or SSE2(besides FLAC's own hand-written assembly SSE optimizations for all SSE supporting CPUs).


CU, Martin.
gottkaiser
It's working now.
Martin H
@gharris999

From the "Quick-Reference Guide to Optimization with Intel® Compilers" :
http://cache-www.intel.com/cd/00/00/22/23/222300_222300.pdf

/Qax

Automatic Processor Dispatch. Generates specialized code and enables
vectorization for the indicated processors while also generating non-processor-
specific code.

W – Generate SSE2 and SSE instructions and optimize for the Intel Pentium4
processor, Intel Xeon processor with SSE2, and other compatible processors that
include SSE2 and SSE such as AMD* processors
.

N – Generate SSE2 and SSE instructions for Intel processors and optimize for
the Intel Pentium4 and Intel Xeon processor with SSE2. The processor-specific
code path may contain features that are not supported on other processors.The
generic code path does not contain such features.


PS. Thank you very much for making these compiles, it's much appreciated smile.gif


CU, Martin.
gharris999
I think I tried the /QaxW option at least once but found that the flac binary ran slower on my system. I'm sorry now that I didn't save that binary (at least I don't think I saved it, but I'll check tomorrow.) Anyway, my "trialware" period for the intel compiler expired last week so I can't do any more compiles for the time being. Since I don't even own VS2005 and the intel compiler didn't seem to want to play with my version of VS6, I'm not feeling really anxious to plunk down the $$$ to buy IC9.

Mangix
Visual C++ is free to download...

doesn't include SSE3 support. actually, does FLAC even have any SSE3 functions in it?
jcoalson
there is no hand-written assembly using sse3 instructions but the compiler may be able to use them on some of the C code.

Josh
gharris999
In the Adriatic sea, in Croatia, on the island of Vis, at the summit of Mount Hum, stands the humble stone chapel dedicated to Saint Duh, who I have to believe is the patron saint of the blatantly obvious.

So, after resetting my system clock back a week, I recompiled flac 1.1.3 using the /QaxW option recommended by Martin above. You can get the binaries to this latest compile here:

http://www.hegardtfoundation.org/flacstuff...1.3-IC9sseW.zip

I'd appreciate it if someone with an AMD cpu would kick the tires on that and let me know how it performs in comparison to the stock flac release.

I've posted the other versions of my compiles there too:

http://www.hegardtfoundation.org/flacstuff....1.3-IC9sse.zip

http://www.hegardtfoundation.org/flacstuff....1.3-IC9alt.zip

http://www.hegardtfoundation.org/flacstuff...c-1.1.3-IC9.zip

I "think" that you all will find that the flac-1.1.3-IC9sseW.zip version will be the preferred one.
gharris999
QUOTE(jcoalson @ Jan 16 2007, 09:04) *

there is no hand-written assembly using sse3 instructions but the compiler may be able to use them on some of the C code.

Josh


I tried the "Intel Pentium® 4 Processor with Streaming SIMD Extensions 3 (SSE3) (/QaxP)" optimization back in December but it didn't seem to help encoding time on any of my CPUs. But, here are two compiles that include that optimization:

http://www.hegardtfoundation.org/flacstuff...1.3-IC9sse3.zip

http://www.hegardtfoundation.org/flacstuff...-IC9sse3sse.zip
beej
QUOTE(gottkaiser @ Jan 14 2007, 16:09) *

But it' looks like it doesn't work for AMD CPUs.

Perhaps Intel is up to their usual dirty tricks :-)
Naughty Intel
gottkaiser
QUOTE(beej @ Jan 16 2007, 20:20) *

Perhaps Intel is up to their usual dirty tricks :-)
Naughty Intel



You should read the thread! It was working later! :-)
dyneq
Here's my output from the testflac.cmd included with the latest compile:

CODE
Flac and MetaFlac 1.1.3 stock:     0.031
Flac and MetaFlac 1.1.3 IC9sse:     0.031
Flac and MetaFlac 1.1.3 IC9sseW:    0.015


I have a P4 530 3.00GHz with HT, MMX and Streaming SIMD extensions up to version 3.

I just ran it as is. Did I do it correctly?

John
beej
QUOTE(gottkaiser @ Jan 16 2007, 21:29) *

You should read the thread! It was working later! :-)

I did read the thread. The question is, does the AMD cpu follow the same code path
as the Intel cpu or is it using the slower generic one?
gharris999
QUOTE(dyneq @ Jan 16 2007, 12:53) *

Here's my output from the testflac.cmd included with the latest compile:

CODE
Flac and MetaFlac 1.1.3 stock:     0.031
Flac and MetaFlac 1.1.3 IC9sse:     0.031
Flac and MetaFlac 1.1.3 IC9sseW:    0.015


I have a P4 530 3.00GHz with HT, MMX and Streaming SIMD extensions up to version 3.

I just ran it as is. Did I do it correctly?

John

In order for the test script to run with any meaning, you need to do the following:

Download the stock flac binaries from source forge and copy flac.exe and metaflac.exe to flac_113.exe and metaflac_113.exe in the testscript folder.

Then download the sse compile mentioned above and copy flac.exe and metaflac.exe to flac_113_IC9sse.exe and metaflac_113_IC9sse.exe in the testscript folder. Also copy libmmd.dll and msvcr80.dll to the testscript folder.

Then then download the sseW compile mentioned above and copy flac.exe and metaflac.exe to flac_113_IC9sseW.exe and metaflac_113_IC9sseW.exe in the testscript folder.

Then then then, rip a CD using EAC (or another ripping tool) to an image file + cuesheet and name those files blip.wav and blip.cue and put them into the testscript folder.

Finally, find a jpg file which could be credible cover art and copy that to blip.jpg in the testscript folder.

Then run the test script. You ought to get meaningful results then.
gharris999
QUOTE(beej @ Jan 16 2007, 13:09) *

I did read the thread. The question is, does the AMD cpu follow the same code path
as the Intel cpu or is it using the slower generic one?


Absent someone with an AMD cpu willing to replicate these compiles in a debug version and then trace through the execution in a debugger, we'll never know.

I suggest that you try some of these compiles on your AMD machine with a test script that can record the encoding times vs. the stock flac binaries. If you can get them to work, and if one is is noticeably faster than the stock version and if the resulting flac decodes to a wav that is identical to the original source wave, then...fine. Use said compile. If not, then don't.

But either way, post your results please.

Martin H
Intel Celeron 1.7GHz, which is identical to a P4 Willamette, except with only a half L2 cache size.

Image.wav : 456MB

flac_113_stock.exe : 131.840 sec.

flac_113_IC9sse.exe : 124.289 sec.

flac_113_IC9sseW.exe : 121.455 sec.


Meassured with Timer.exe(Global time).


I know that gharris999 asked for AMD tests, but i just posted to show that the IC9sseW compile is faster on my Intel CPU.
Martin H
QUOTE(beej @ Jan 16 2007, 13:09) *

I did read the thread. The question is, does the AMD cpu follow the same code path
as the Intel cpu or is it using the slower generic one?

According to the icl docs, then the flac_113_IC9sseW.exe compile will use SSE2 code on AMD's and Intel's supporting SSE2 and generic for the rest. The flac_113_IC9sse.exe compile will use SSE2 code on Intel's supporting SSE2 and generic for rest.

CU, Martin.
gharris999
QUOTE(Martin H @ Jan 16 2007, 14:17) *

QUOTE(beej @ Jan 16 2007, 13:09) *

I did read the thread. The question is, does the AMD cpu follow the same code path
as the Intel cpu or is it using the slower generic one?

According to the icl docs, then the flac_113_IC9sseW.exe compile will use SSE2 code on AMD's and Intel's supporting SSE2 and generic for the rest. The flac_113_IC9sse.exe compile will use SSE2 code on Intel's supporting SSE2 and generic for rest.

CU, Martin.

Yes, and both those compiles use the FLAC__SSE_OS define that invokes the SSE2 C & assembler code that Josh includes in the source files but isn't turned on in the stock binaries.

Martin: thank you for suggesting QaxW optimization.
Martin H
QUOTE(gharris999 @ Jan 16 2007, 22:33) *

Yes, and both those compiles use the FLAC__SSE_OS define that invokes the SSE2 C & assembler code that Josh includes in the source files but isn't turned on in the stock binaries.

Completely forgot about that smile.gif
QUOTE

Martin: thank you for suggesting QaxW optimization.

You are very welcome, mate beer.gif

And again, many thank's for making all these compiles available for us, it's much appreciated smile.gif


CU, Martin.
HbG
Is there any chance of these improvements becoming available for processors that only support SSE?
gib
I modified the test script quite a bit (removing all the metaflac stuff, the tagging, replaygain, etc.) so that it was just a straight wav to flac encode and nothing more. This is on an Athlon64 3400+ running Windows2000.

Flac 1.1.3 stock: 79.640
Flac 1.1.3 IC9sseW: 81.141

As you can see, despite the compiler optimizations, the IC9sseW version actually runs a little slower on my computer. Unfortunately it seems that optimizing flac for AMD processors, the Athlon64 at least, requires a bit more effort than setting some compiler flags in IC9.

Edit: Added a bit of relevant information.
gharris999
Well, that's disappointing. Anyone else with an AMD cpu bother to test this?
Synthetic Soul
QUOTE(gharris999 @ Jan 17 2007, 04:38) *
Well, that's disappointing. Anyone else with an AMD cpu bother to test this?
I would bother, if I had a clear idea of what I should be testing! There are so many versions released and so many TLA's being thrown around that I've just lost the will to live plot.

Perhaps you could post a link to all current compiles and specify on what machines they are recommended?

In the meantime I guess I will take a look at IC9sseW.
Martin H
QUOTE(Synthetic Soul @ Jan 17 2007, 08:49) *

Perhaps you could post a link to all current compiles and specify on what machines they are recommended?

Hi Synthetic Soul smile.gif

The compiles to test are :

http://www.hegardtfoundation.org/flacstuff....1.3-IC9sse.zip
(According to the docs, SSE2 for only Intel's supporting it and generic for rest)

http://www.hegardtfoundation.org/flacstuff...1.3-IC9sseW.zip
(According to docs, SSE2 for both Intel and AMD that supports it and generic for rest)

To see which is fastest on AMD SSE2's +

Compare the time against the stock version.

CU, Martin.
Martin H
@gib

Could you please re-test with all background apps turned completely off e.g. anti-virus etc. The icl compiler switch that gharris999 has used to compile the binary with, has made SSE2 optimized code which will be enabled for AMD SSE2 capable CPU's, so your results are very strange to say the least ???

CU, Martin.
Synthetic Soul
Thanks for the info Martin. smile.gif

As an AMD user on all three computers that I have access to (work, home pc and laptop) I guess I'll stick with testing IC9sseW.

However, there are other versions mentioned in posts 19 and 20, including IC9alt, IC9, IC9sse3, and IC9sse3sse. It appears, apart from possibly IC9alt, these bear no relevance to me, but if I were an Intel user I would be very confused...
Egor
QUOTE(Martin H @ Jan 17 2007, 17:40) *
Could you please re-test with all background apps turned completely off e.g. anti-virus etc.

By the way, I think Timer 3.01 should be used to measure process' execution time.
HbG
CPU: Athlon XP, SSE

Flac 1.1.3: 8.015s
IC9 sse: 8.671s
Flake 114: 3.968s

I'll stick to flake for the time being. smile.gif
Wombat
QUOTE(HbG @ Jan 17 2007, 14:42) *

CPU: Athlon XP, SSE

Flac 1.1.3: 8.015s
IC9 sse: 8.671s
Flake 114: 3.968s

I'll stick to flake for the time being. smile.gif

Try flake -10 -m 1 for real shocking results smile.gif (Guruboolez once played with this)
Martin H
@gharris999

I have now just read another ICL manual, which where updated for ICL9, that i found on the net and it said that the /QxN and /QaxN flags would add some extra P4 only optimizations to the code which the /QxW and /QaxW flags wouldn't include.

I perfectly understand if you don't want to make more compiles now, since you allready have made so many, but i would really like to have a compile for only P4 CPU's made with the flag /QxN and Josh's hand-written SSE assembly enabled in the sources. The compiles that target multiple pathways e.g /QaxN will be a little slower than straight /QxN compiles because of runtime checks of what CPU it is dealing with and hence, for P4 users(like yourself), then a compile with the /QxN flag and Josh's SSE assembly enabled would be the absolutely best compile possible smile.gif But i understand if you have had enough of compiling for now smile.gif Also, i personally would prefer to skip the libmmd.dll dependency(as it is described as a P4 only compile anyway), or else to instead link to libmmd.dll statically and so if you would also like this, then here are the instructions :

Skip the "libmmd" dependency :

icl -c -MD t.cpp

xilink /nodefaultlib:libmmd.lib t.obj ---- it will link msvcrt.lib instead.

Or

Link "libmmd" as static:

icl -MT t.cpp

It will link the libmmds.lib---the static version of libmmd.dll.


CU, Martin.
gharris999
QUOTE(Synthetic Soul @ Jan 17 2007, 04:53) *

However, there are other versions mentioned in posts 19 and 20, including IC9alt, IC9, IC9sse3, and IC9sse3sse. It appears, apart from possibly IC9alt, these bear no relevance to me, but if I were an Intel user I would be very confused...


If I even remotely knew what the hell I was doing, I would offer some guidance. But as it is, I'm just throwing semi-random permutations of optimization args at the compiler and hoping for the best. (Now that I think about it, is this really so different than my government's approach to achieving stability and efficiency in Iraq?)

Anyway, I was hoping that wiser heads than mine would try these compiles out, asses flaws, reveal advantages, and create a consensus as to which optimizations worked and which were useless. I always expected a semi-official optimized compile would then emerge from Josh's or John33's shop.

Edit: Martin: I'll try to get your suggestions compiled sometime this weekend. Thanx.
Ingemar
I got this link:

http://home.earthlink.net/~gharris999/flac-1.1.3-IC9sse.zip

But its not working and I can't find another download location.

Is it down or has it been moved?

Thx.
Martin H
QUOTE(gharris999 @ Jan 17 2007, 17:06) *

Edit: Martin: I'll try to get your suggestions compiled sometime this weekend. Thanx.

Thank's mate - much appreciated smile.gif

CU, Martin.
gharris999
QUOTE(Ingemar @ Jan 17 2007, 09:22) *

Is it down or has it been moved?

Moved. Use http://www.hegardtfoundation.org/flacstuff....1.3-IC9sse.zip and http://www.hegardtfoundation.org/flacstuff...1.3-IC9sseW.zip instead.

Synthetic Soul
QUOTE(gharris999 @ Jan 17 2007, 16:06) *
Anyway, I was hoping that wiser heads than mine would try these compiles out, asses flaws, reveal advantages, and create a consensus as to which optimizations worked and which were useless. I always expected a semi-official optimized compile would then emerge from Josh's or John33's shop.
OK, thanks for the confirmation. As an AMD user, I am running tests on IC9sseW at home. I can then compare these to the tests I've already run on the official 1.1.3. It seems like the other versions are for Intel users to test.
Ingemar
QUOTE(gharris999 @ Jan 17 2007, 17:54) *

QUOTE(Ingemar @ Jan 17 2007, 09:22) *

Is it down or has it been moved?

Moved. Use http://www.hegardtfoundation.org/flacstuff....1.3-IC9sse.zip and http://www.hegardtfoundation.org/flacstuff...1.3-IC9sseW.zip instead.



Thank you smile.gif
gib
QUOTE(Martin H @ Jan 17 2007, 01:40) *

@gib

Could you please re-test...

The results do seem intuitively odd, but they are definitely correct. Nonetheless, I went ahead and retested after shutting down background programs as per your request. I didn't really have much to shut down, however, as I run a very lean system to begin with. Again, this is an Athlon64 3400+ running Windows 2000. Note: I tested with a different CD image this time, so the encoding times should not be directly compared to those in my previous post. Also, rather than using the test script from my previous test, I used timer.exe, as suggested by Egor in post #38.

Flac 1.1.3 original: 71.781s
Flac 1.1.3 IC9 sseW: 75.563s

Same type of results as before. HgB's results in post #39 show the same thing. For whatever reason, the ordinary flac 1.1.3 outperforms the optimized version on AMD processors.

Lastly, on a somewhat humorous side note, not only is the optimized build slower, but it produces larger files as well. For both of the tests I've run, the compressed CD image from the optimized build was about 80KB larger than the ordinary compile. laugh.gif
Synthetic Soul
Here's my results for my AMD Athlon XP 2400+ (CPU-Z report):

CODE
          |               Official                  |               IC9sseW
Setting   | Filesize       Comp %    Enc     Dec    | Filesize       Comp %    Enc     Dec
==========+=========================================+=====================================
0         | 1506304162    70.744%    95x    108x    | 1506304162    70.744%    90x    126x
5         | 1403710148    65.926%    40x     97x    | 1403845807    65.932%    37x    112x
6         | 1402766383    65.881%    36x     97x    | 1402903529    65.888%    37x    111x
7         | 1400824078    65.790%    12x     97x    | 1400902731    65.794%    12x    112x
8         | 1397210593    65.621%    10x     96x    | 1397336348    65.626%     9x    111x
8 -Ax2    | 1395443983    65.538%     6x     95x    | 1395572750    65.544%     5x    108x

As you can see, encoding speeds appear to be slightly impaired, but decoding speeds are reasonably improved. As noted previously, the IC9sseW compile produces slightly larger files, although curiously not with -0 (presumably this is because it does not use the same filters as the others).
dyneq
OK, I followed the instructions for running the testflac.cmd script. To recap, I am using:

Intel P4 530 3.0 GHz, 1GB RAM, XP, HT enabled. I used a 609 MB WAV, and the JPG is 25 KB.

The results:

CODE
Flac and MetaFlac 1.1.3 stock:     198.891
Flac and MetaFlac 1.1.3 IC9sse:     140.266
Flac and MetaFlac 1.1.3 IC9sseW:    108.359


All I can say is 'HOLY COW'! If you end up running Martin's QaxW optimization suggestion, by all means, post it here and I will re-run the test.

John

This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.