Help - Search - Members - Calendar
Full Version: my speed test results of mingw compiles of lame
Hydrogenaudio Forums > Lossy Audio Compression > MP3 > MP3 - Tech
bluesky
I built various lame.exe files from the 3.96.1 source using MinGW-3.2.0-rc3, MSYS 1.0.10, and nasmw 98.39 for win32 according to the guide detailed here.

After building, I tested them encoding two different wav files (one about 19 meg and the other about 50 meg) using the --alt-preset standard switch. I also tested their ability to encode a 19 meg file to cbr. I ran each test 5 times for the vbr runs, and 3 times for the cbr run. The results are in the link below.

You'll see in the spread sheet the run times (play/CPUx), a "% slower than icl4.5" -- that is the what percentage slower a build is compared to the icl4.5 build -- the standard deviation of the run times, and the options I used to build them in the ./configure step.

Results: The rarewares.org icl4.5 build was found to be the quickest in all 3 tests, with mitiok's coming in at a very close second (<1% slower). My quickest build was 2-3% slower in the vbr tests, and really fell short in the cbr test.

Test machine: Athlon 3200+, 1.0 gig PC3200, ASUS A7N8X-E Deluxe mainboard, Hitachi 7200 SATA gen 1 drives. etc.
Test OS: Windows XP Pro SP2

If anyone has a suggestion for other switches to try given these results, I would be glad to compile it with them and run the tests.

View the test results here
DarkAvenger
Intersting, but obviously you don't know much about CFLAGs.

1) Check whether lame10 and lame11 are binary identical. As Lame is pure C code CXXFLAGS are pretty superfluous.

2) Flags like -m3dnow -msse etc are superfluous, as well, as they get enabled by using -march=athlon-xp

3) Please test my CFLAGS, as given in the other thread (maybe with and without -fprefetch-loop-arrays). In my tests on linux I found -O2 to be faster than -O3, because -O2 doesn't enable automatic inlining which bloats the binary. Furthermore if your mingw is based on gcc 3.4.x I realy don't advise to use -O3 as it also enables -fweb which seriously breaks stuff (xine-libs, kde, etc) on athon-xp, esp if asm is used. I haven't tried whether current gcc 3.4.3 fixes this.
bluesky
Thanks for the suggestions, I'll compile them now and run them through the same tests as the other builds.
dev0
I admire your ambition, but how about learning what some of those switches mean before mindlessly trying more semi-random combinations?

The recent mppenc fiasco should have taught us that aggressive compiler optimizations do have their downsides and should only be applied with care and knowledge.
bluesky
QUOTE(dev0 @ Feb 2 2005, 03:13 PM)
I admire your ambition, but how about learning what some of those switches mean before mindlessly trying more semi-random combinations?

The recent mppenc fiasco should have taught us that aggressive compiler optimizations do have their downsides and should only be applied with care and knowledge.
*



What switches would you recommend?
bluesky
@DarkAvenger

I tried your suggestions as well as several others. Tables have been updated if you care to see.

(app id = lame14) ./configure --enable-nasm CFLAGS="-march=athlon-xp -mtune=athlon-xp -O2 -pipe -frename-registers -fomit-frame-pointer -fprefetch-loop-arrays"
(app id = lame15) ./configure --enable-nasm CFLAGS="-march=athlon-xp -mtune=athlon-xp -O2 -pipe -frename-registers -fomit-frame-pointer"

You'll note that these two are slower than lame10 in both aps tests, but lame15 is now the fastest in the ap cbr 192 test.
bluesky
QUOTE(DarkAvenger @ Feb 2 2005, 05:37 AM)
Intersting, but obviously you don't know much about CFLAGs.

1) Check whether lame10 and lame11 are binary identical. As Lame is pure C code CXXFLAGS are pretty superfluous.
*



CODE
MD5 Sum of lame10.exe: 99 08 76 36 93 3A E9 47  DA 0E 80 DC 33 58 38 C8
MD5 Sum of lame11.exe: EC EF 2C 17 2D A1 C3 BD  0A A0 29 21 C3 01 4D E4
DarkAvenger
QUOTE(bluesky @ Feb 2 2005, 10:01 PM)
CODE
MD5 Sum of lame10.exe: 99 08 76 36 93 3A E9 47  DA 0E 80 DC 33 58 38 C8
MD5 Sum of lame11.exe: EC EF 2C 17 2D A1 C3 BD  0A A0 29 21 C3 01 4D E4

*



Strange, anybody care to elaborate what goes on?

Could you also test:

CODE
CFLAGS="-march=athlon-xp -mtune=athlon-xp -O2 -pipe -frename-registers -fomit-frame-pointer -finline-functions"


I want to see whether the speed increae in the non CBR tests come from inlining and if yes, I will try to analyse which functions get inlined so we could patch Lame that those function get always inlined...

Oh, btw what Athlon-XP core do you have? Maybe you have a Barton and thus 512KB L2 cache, which might explain why -O3 behaves better for you than for me. But I haven't thouroughly tersted this with Lame...
bluesky
QUOTE(DarkAvenger @ Feb 3 2005, 01:26 AM)
Could you also test:

CODE
CFLAGS="-march=athlon-xp -mtune=athlon-xp -O2 -pipe -frename-registers -fomit-frame-pointer -finline-functions"


I'll do it later today and update the url when it's finished.

QUOTE
Oh, btw what Athlon-XP core do you have? Maybe you have a Barton and thus 512KB L2 cache, which might explain why -O3 behaves better for you than for me. But I haven't thouroughly tersted this with Lame...
*



If I'm not mistaken, the 3200+ is a Barton. I'm pretty sure it does have 512k cache. I can also confirm this later today.
bluesky
Updated. That suggestion is app id = lame19
bluesky
According to Central Brain Identifier, I have a Thorton core with 512 k cache... then again, it also tells me I have 0 voltage!

user posted image
DarkAvenger
Hmm, interesting. It seems inlining functions is really hurting, as my logics suggest.

So the only thing left is the "-fweb" flag.

QUOTE
-O3 Optimize yet more.  -O3 turns on all optimizations specified by -O2 and also turns on
          the -finline-functions, -fweb and -frename-registers options.


So, if you want to do a last try, substitute the inline flag with the web flag. Theoretically this should be the fastest set of flags, but as I said, -fweb isn't safe. I'll test compiling xine-libs again and check whether it got fixed.

BTW, have you tested whether the outputed mp3s are binary identical?

Good news: It seesm that -fweb has been fixed with current gcc. At least mpeg2 playback doesn't seem to be corrupt in xine anymore. Need to test further.
bluesky
QUOTE(DarkAvenger @ Feb 3 2005, 04:57 PM)
Hmm, interesting. It seems inlining functions is really hurting, as my logics suggest.

So the only thing left is the "-fweb" flag.


Added the following two compiles:
CODE
lame20: --enable-nasm CFLAGS="-march=athlon-xp -mtune=athlon-xp -O2 -pipe -frename-registers -fomit-frame-pointer -fweb"
CODE
lame21:  --enable-nasm CFLAGS="-march=athlon-xp -mtune=athlon-xp -O3 -pipe -frename-registers -fomit-frame-pointer -fweb"


QUOTE
BTW, have you tested whether the outputed mp3s are binary identical?


I added a col. entitled, "sfv of mp3" so you can see that the checksums are the same for most, but not for all. BTW, your suggestion for lame20 is now the fastest in all 3 tests (0.06% faster in test 1, 0.48% in test 2, and 0.46% in test 3).

I still don't see why we can't get a compile on GCC that's optimized for an Athlon XP CPU that's faster than an ICL compile since ICL is afterall optimized for intel chips!

..any other GCC junkies out there have other suggestions for switches?
robert
You can add "--enable-all-float" to your configuration.
DarkAvenger
QUOTE(bluesky @ Feb 4 2005, 01:16 AM)
Added the following two compiles:
CODE
lame20: --enable-nasm CFLAGS="-march=athlon-xp -mtune=athlon-xp -O2 -pipe -frename-registers -fomit-frame-pointer -fweb"
CODE
lame21:  --enable-nasm CFLAGS="-march=athlon-xp -mtune=athlon-xp -O3 -pipe -frename-registers -fomit-frame-pointer -fweb"


Oh well, haven't you read what I quoted? -O3 implied -fweb, so it shouldn't make a difference if -fweb is attached.

QUOTE
I added a col. entitled, "sfv of mp3" so you can see that the checksums are the same for most, but not for all.


Ah, good to know. As you can see -ffast-math alters things.

QUOTE
BTW, your suggestion for lame20 is now the fastest in all 3 tests (0.06% faster in test 1, 0.48% in test 2, and 0.46% in test 3).


Ah, great, so my logics are still OK. I hope you learned a bit how to find optimal flags. Instead of randomly choosing then, use a system...

BTW, these are now my new gentoo flags, as well. smile.gif

QUOTE
I still don't see why we can't get a compile on GCC that's optimized for an Athlon XP CPU that's faster than an ICL compile since ICL is afterall optimized for intel chips!


Well, ICL is technologically much advanced (but on the other hand broken in many aspects). Ie. gcc 3.x has no vectorizer. So even if ICL doesn't do 3dnow!, it beats gcc. But it is nice to see that the gap is smaller than I thought.

The upcoming gcc4 will be technologically comparable to ICL, but it needs time to get rid of regressions.

QUOTE
..any other GCC junkies out there have other suggestions for switches?


Well, you could add the fast-math switch to my flags, but this will alter the file. I suppose this should give a small speed-up. But ont he other hand, looking at your table it doesn't.

There are not much flags left which don't alter quality. -ftracer is said to be beneficial, but in my tests I haven't found much difference except that it borked on other places.
Gabriel
Have you tried 2-pass compiling?
I few time ago it increased speed.
DarkAvenger
Are you referring to profiled compiling? How is this done?
Garf
I would suggest (using gcc 3.4.x):

1st pass:

-O3 -fprofile-generate -march=athlon-xp -mtune=athlon-xp

let LAME encode a few songs

2nd pass:

-O3 -fprofile-use -march=athlon-xp -mtune=athlon-xp

Test speed now

If the jury says it's ok to use -ffast-math with LAME then it can be added as well.

Test the above combination *first* before you try adding on other settings. Profile feedback optimization is generally smarter than you about when to inline and not. I think -fomit-frame-pointer is automatic with gcc 3.4, but I'm not sure so that's the first thing I'd try adding.

If you change switches you need to delete the *.gc* files that are generated.

Make sure the compiler is finding them when running the second pass.
mp4junkie
Has anyone compiled Lame with OpenWatcom?

When the compiler was commercial, it was used for Doom, Doom 2, as well as many other games, the Bink and Smacker video codec used in Warcraft, etc.... I always wanted to prove that OpenWatcom is very fast on Pentium computers.
It might even be updated for newer processors
Maybe even faster than Intel.

Has anyone else tried to?

Or maybe do some kind of proof of concept Intel vs Watcom showdown?


Google search for intel OR icl vs OR versus openwatcom OR watcom
http://www.google.com/search?num=30&hl=en&...com&btnG=Search
DarkAvenger
@Garf

-fomit-frame-pointer is not default on x86. It depends on whetehr the platform supports debugging symbols (or alike) with it.

So I did a very quick test on linux and yes, here profiled -march=athlon-xp -mtune=athlon-xp -O3 -pipe -fomit-frame-pointer is better then above "optimal flags" with profile.

Optimal flags: 8.11 play/CPU
profiled '': 8.31 play/CPU
profiled "-O3": 8.37 play/CPU

I wonder whether the ICL compiles by mitiok and john33 are profiled?
john33
QUOTE(DarkAvenger @ Feb 4 2005, 04:35 PM)
I wonder whether the ICL compiles by mitiok and john33 are profiled?
*


Speaking for myself, no, they're not. I suppose it would be worth doing on the final releases, but the alphas/betas tend to change too frequently to make it worth the effort. I remember when I profiled a version of LAME some time back, the difference in performance was sufficiently marginal as to hardly make it worthwhile.
damaki
-fomit-frame-pointer is implied by -O2 and -O3, I have seen that in gcc manual, 2 days ago.
DarkAvenger
I know what is written in the manual, nevertheless it is not right.
bluesky
QUOTE(Garf @ Feb 4 2005, 09:59 AM)
I would suggest (using gcc 3.4.x):

1st pass:

-O3 -fprofile-generate -march=athlon-xp -mtune=athlon-xp

let LAME encode a few songs

2nd pass:

-O3 -fprofile-use -march=athlon-xp -mtune=athlon-xp

Test speed now
*



Yeah it's gcc 3.4.2.

I did it 2-pass with your suggestions: first pass uses: /configure --enable-nasm CFLAGS="-O3 -fprofile-generate -march=athlon-xp -mtune=athlon-xp"

(did all 3 tests)

2nd pass uses: /configure --enable-nasm FLAGS="-O3 -fprofile-use -march=athlon-xp -mtune=athlon-xp"

I updated the data sheet now -- this one came in dead last for all 3 tests (it's lame22 btw).

I'm doing it now with the -FFP swich add (that'll be lame23 in the datasheet which should be updated in about 10-20 mins).
DarkAvenger
Did you do a make clean in between? It seems you benched the profiling binary not the profiled one...

Oh, I hope you used CFLAGS and not FLAGS in profile-use...
bluesky
QUOTE(DarkAvenger @ Feb 5 2005, 06:15 AM)
Did you do a make clean in between? It seems you benched the profiling binary not the profiled one...
*



I'm pretty sure I didn't bench the first pass bin because its date/time stamp differed from the 2nd pass encode.

I didn't do a make clean though... so is this the procedure:

1. compile pass 1
2. test binary
3. make clean
4. ./configure switching to -fprofile-use
5. make; make install

???

Also, by using that binary before issuing the same set of flags (with the exception of the profileuse), how does the compilier learn anything? It doesn't write a log or the like.
Gabriel
QUOTE
Also, by using that binary before issuing the same set of flags (with the exception of the profileuse), how does the compilier learn anything? It doesn't write a log or the like.

The compiler should log branch decisions into a file, and use this data as input for the second pass.
That is why you have to run a few encodes with Lame after first compile so usage statistics can be gathered.

There is probably something not properly working in your 2 pass compilation, as it should NOT reduce speed.
bluesky
I think my problem was that I didn't do a "make clean" after testing and prior to doing the 2nd pass.

I just did the 2nd pass compile of new build that is FASTER (by nearly 1%) than the rarewares.org ICL build in the 19 meg file aps test!!

Here are the data for the new build:

54 meg file (-0.6% slower)
19 meg file (0.99% FASTER)
It's still slower in the cbr test, but is the fastest one yet.

View the test results here for yourself

Comments are welcomed smile.gif
dev0
More profiling could make it even faster.
bluesky
QUOTE(dev0 @ Feb 5 2005, 07:13 AM)
More profiling could make it even faster.
*



I thought I was profiling when doing this 2-pass procedure? Am I not understanding something?
dev0
This is what I ment when I said:
QUOTE
I admire your ambition, but how about learning what some of those switches mean before mindlessly trying more semi-random combinations?


You don't even understand what you are doing.
AstralStorm
In about a week of time...
I'll jump in with tests of GCC 4 vs GCC 3.4.3 (both CVS snapshots) on Linux.
Safe flags first, then something more crazy - all with MD5sums of the files.

Fortunately, I don't own any M$ soft and thus can't compare it to any Windows build. smile.gif

Some info right now:
LAME compiles with both compilers and produces audibly correct files. (unABXable for me)

CFLAGS="-march=athlon-xp -O2 -pipe"
Nasm available and working.

CPU info:
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 10
model name : AMD Athlon™ XP 2400+
stepping : 0
cpu MHz : 2004.552
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse pni syscall mmxext 3dnowext 3dnow
bogomips : 3973.12
bluesky
QUOTE(dev0 @ Feb 5 2005, 07:21 AM)
This is what I ment when I said:
QUOTE
I admire your ambition, but how about learning what some of those switches mean before mindlessly trying more semi-random combinations?


You don't even understand what you are doing.
*



Of course I don't. I'm not a dev or even a programmer... my experience compiling binaries is very narrow. In the past, I was forced to d/l source (bins/debs were out there) for some great apps under LINUX and do a simple "./configure; make; make install" to get a binary.

After taking some great suggestions here from many people, a lame.exe has been created that is faster than the gold standard compile in speed on an athlon xp CPU in some cases. That's no small feat to me.

If you want to help me understand what I'm doing that would be great.
bluesky
QUOTE(AstralStorm @ Feb 5 2005, 07:35 AM)
In about a week of time...
I'll jump in with tests of GCC 4 vs GCC 3.4.3 (both CVS snapshots) on Linux.
Safe flags first, then something more crazy - all with MD5sums of the files.

Fortunately, I don't own any M$ soft and thus can't compare it to any Windows build. smile.gif
*



That would be very cool. Perhaps you could take one or two of the switches I found to give the fastest results and include them in your tests? If you want, I'll make the wav files I used available to you so we can do a apples-to-apples comparison.... although that would raise some copywrite issues... hmmm.
Garf
QUOTE(bluesky @ Feb 5 2005, 02:42 PM)
If you want to help me understand what I'm doing that would be great.
*



This website can help you:

http://www.google.com
AstralStorm
QUOTE(bluesky @ Feb 5 2005, 02:45 PM)
If you want, I'll make the wav files I used available to you so we can do a apples-to-apples comparison.... although that would raise some copywrite issues... hmmm.
*


I'm currently using 3 classic samples, namely fatboy, castanets and bachpsichord...
Please check problem samples archive.
Bachpsichord was featured in one of RJAmorim's tests.
DarkAvenger
@bluesky

Regarding your webpage: Have you previewed it in firefox? It looks very fsked up here (sometimes more, sometimes less). Please use proper HTML...

BTW:

QUOTE
1. compile pass 1
2. test binary
3. make clean
4. ./configure switching to -fprofile-use
5. make; make install


ad 2: This is not about "testing" It is about profiling. Input as much data as possible and use as much modes as possible and the binary gathers profiling information. So in 2nd pass this info is used to optimize the binary.
AstralStorm
QUOTE(DarkAvenger @ Feb 5 2005, 04:10 PM)
Regarding your webpage: Have you previewed it in firefox? It looks very fsked up here (sometimes more, sometimes less). Please use proper HTML...
*


Hey, it's all right here...
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
bluesky
QUOTE(DarkAvenger @ Feb 5 2005, 09:10 AM)
@bluesky

Regarding your webpage: Have you previewed it in firefox? It looks very fsked up here (sometimes more, sometimes less). Please use proper HTML...
*




Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041217

Looks fine in my end. Also, it should be proper HTML... I just exported my excel tables as a webpage.
krmathis
bluesky, great work! biggrin.gif
Even if someone around here dont see the real purpose of these speed tests.

Your result page looks fine here, using my own Firefox build:
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b) Gecko/20050204 Firefox/1.0+ (PowerBook)
DarkAvenger
Well, now it looks 99% OK, again...just some lines overwriting each other..

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20050121 Firefox/1.0


BTW, have you tried robert's suggestion and passed --enable-all-float to configure? It indeed gave me about 2-3% speed-up, but I guess the mp3 files won't be identical anymore.

Another interesting finding: With --enable-all-float fast-math improves tspeed even more, but without --enable-all-float fast-math has a neagtive effect, just as can be seen in bluesky's results.
bluesky
@DA

I'll try those switches a bit later and add the respective lines to the data tables.
bluesky
Okay gang,

Completed robert's suggestion of adding the --enable-all-float and DA's suggestion to use the ffast math. The results are indeed a speedup.

Here are the data for the new build (lame25):

54 meg file (+2.09% FASTER than icl build)
19 meg file (+3.09% FASTER than the icl build)
It's still slower in the cbr test, but is the fastest one yet.

...I'll make it the exact same way except this time I will profile the hell out of it (many wav files, many different switches, aps ape, apcbr 192, 224, 320, 128 etc.) and see if it makes a difference.

EDIT:

Well, it did make a difference, if you compare lame25 to 26 (26 being the one with more extensive profiling):

54 meg file (+0.05% FASTER than lame25)
19 meg file (+0.84% FASTER than lame25)
19 meg cbr (+0.94% FASTER than lame25)

The other thing I noticed was that the standard deviations for lame26 are a bit higher (less precision?)... also, the first time it did the encode in all cases, it returned the slowest of the runs. I dunno what to make of that.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.