Help - Search - Members - Calendar
Full Version: Ogg Vorbis optimized for speed
Hydrogenaudio Forums > Lossy Audio Compression > Ogg Vorbis > Ogg Vorbis - Tech
Pages: 1, 2, 3, 4, 5, 6, 7
nyaochi
Some Japanese guys work on speed optimization of libvorbis by using SSE. Blacksword (or 637) launched an Ogg Vorbis acceleration project (in Japanese only) and releases oggenc binary and libvorbis patch based on libvorbis 1.1. This optimization includes SSE implementations of FFT, MDCT, windowing, channel coupling, sorting, psymodel, floor/residue encode, and so on. In my computer (Pentium IV 2.4GHz), ICL8.1 compiled oggenc binary of the optimized version (Archer Beta03) encodes at 23.4x while the one without optimization (ICL8.1 compiled but no SSE patches) does at 15.5x. Hence, this optimization archives ca. 1.5x speed gain. blink.gif

Unlike GoGo-no-coder, it's not forking: he releases a patch for libvorbis source code without absolutely changing algorithm or data structure. This is very good for source code maintenance to keep up with up-to-date official libvorbis, but limits optimization possibility in some degree. Actually, the author says in readme.txt that there's little room left for optimization. So I think it's time for quality evaluation although this optimization is in development stage. After several bugs are found and fixed for the last week, bitrates are quite similar to the reference encoder for all quality values. If you find any bugs or quality degressions from official 1.1 one, please tell us. smile.gif

Contributors are:
- Blacksword (or 637)'s SSE optimization (Japanese only): A number of functions in libvorbis are vectorized to take advantage of SSE instruction set as well as Opt-Sort and wuvorbis. For complete list of optimized functions, see readme.txt (in Japanese but you may easily find it) attached with the binary.
- Manuke's OptSort: Optimization of qsort function that consumes 20% of compression processing time, by assuming that _vp_quantize_couple_sort and _vp_noise_normalize_sort functions in psy.c call qsort with 8 or 32 element. This accelerates the whole compression process by 10%.
- W.Dee's wuvorbisfile (Japanese only?): wuvorbis.dll is a fast Ogg Vorbis decoder with SSE and 3DNow!, which is a part of KiriKiri software (useful for developing multi-media contents or adventure games). wuvorbis.dll decodes 1.4x-1.8x faster (SSE) and 1.5x-1.9x faster (3DNow!) than official libvorbis.

Happy encoding!
dev0
fefe was working on a (apparently buggy) SSE optimization of libvorbis too.
Do the optimizations only effect encoding or decoding as well?
ilikedirtthe2nd
I archived almost 100% (rather 85%, actually wink.gif ) speed incrase (against ICL 8.1 on AMD Athlon XP 1800+)

ICL 8.1: 9,8x
Optimized 18,0x.

Pretty good ohmy.gif
TedFromAccounting
Wow ohmy.gif Now that is FAST. My results were similar to ilikedirtthe2nd's (actually a little better).
nyaochi
QUOTE(dev0 @ Nov 5 2004, 04:37 AM)
fefe was working on a (apparently buggy) SSE optimization of libvorbis too.
Do the optimizations only effect encoding or decoding as well?
*

Oh, I didn't know fefe's optimization. blink.gif I'll check whether it benefits Blacksword's optimization. smile.gif

IMHO this optimization effects on both encoding and decoding sides although optimized oggdec is not tested or released. Several functions for decodnig (e.g., vorbis_synthesis_blockin, mapping0_inverse, mdct_backward, etc.) are optimized too.
QuantumKnot
Whoa, it's really fast ohmy.gif

On my P4 2.4 GHz:

ICL compiled oggenc from rarewares: 13.2x
SSE optimised oggenc: 20.5x
Bonzi
Pretty nice speedup here too:
oggenc from rarewares 10.4x
SSE optimized 15.3x
Music Mixer
Hello!

Well, I have got an older machine (p3 700) and recieved a speedup from 4.4 to 9.3x realtime.

Have you guys tested the SSE2 optimized build at http://homepage3.nifty.com/blacksword/
?

I wonder how big the speedup with this build is for p 4 and amd 64 cpus.

smile.gif
Sebastian Mares
According to my tests...

ICL 8.1 Standard:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s


ICL 8.1 Pentium 4:

CODE

File length:  4m 58,0s
Elapsed time: 0m 17,0s
Rate:         17,5529
Average bitrate: 236,7 kb/s


SSE:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s


SSE2:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s


Tested with "Toto - Africa" on a Pentium 4 with 3.2 GHz, 512 MB RAM, running Windows XP Professional Service Pack 1.
esa372
I got a good increase, too...

SSE2
CODE
       File length:  5m 23.0s
       Elapsed time: 0m 12.0s
       Rate:         26.9556
       Average bitrate: 175.3 kb/s

ILC 8.1
CODE
       File length:  5m 23.0s
       Elapsed time: 0m 19.0s
       Rate:         17.0246
       Average bitrate: 175.3 kb/s



But I can't seem to get it to work on FLAC files...
CODE
ERROR: Input file "01.flac" is not a supported format.

Am I missing something??

Thanks,

~esa

:edit: typo
ilikedirtthe2nd
QUOTE(esa372 @ Nov 6 2004, 03:10 PM)
But I can't seem to get it to work on FLAC files...
CODE
ERROR: Input file "01.flac" is not a supported format.

Am I missing something??


Standard oggenc doesn't input lossless files directly. Only Oggenc2.3 from rarewares does.

Regards; ilikedirt
dev0
QUOTE(ilikedirtthe2nd @ Nov 6 2004, 04:24 PM)
QUOTE(esa372 @ Nov 6 2004, 03:10 PM)
But I can't seem to get it to work on FLAC files...
CODE
ERROR: Input file "01.flac" is not a supported format.

Am I missing something??


Standard oggenc doesn't input lossless files directly. Only Oggenc2.3 from rarewares does.

Regards; ilikedirt
*


The standard oggenc supports FLAC input perfectly. It's a compile-time option AFAIK.
john33
QUOTE(dev0 @ Nov 6 2004, 03:46 PM)
The standard oggenc supports FLAC input perfectly. It's a compile-time option AFAIK.
*

It sure is. smile.gif
esa372
QUOTE(ilikedirtthe2nd @ Nov 6 2004, 08:24 AM)
Standard oggenc doesn't input lossless files directly.

QUOTE(dev0 @ Nov 6 2004, 08:46 AM)
The standard oggenc supports FLAC input perfectly.


huh.gif

Well, I can't say that the issue is any clearer for me now...
ilikedirtthe2nd
QUOTE
It's a compile-time option AFAIK.


That means, oggenc is able to input flac, if this is enabled when compiling. So: generally it is able to read flac, but this version is not.
esa372
QUOTE(ilikedirtthe2nd @ Nov 6 2004, 09:19 AM)
...oggenc is able to input flac, if this is enabled when compiling. So: generally it is able to read flac, but this version is not.
Ah... thank you for the clarification!

smile.gif

~esa
nyaochi
QUOTE(Music Mixer @ Nov 6 2004, 04:10 PM)
Have you guys tested the SSE2 optimized build at http://homepage3.nifty.com/blacksword/
?

I wonder how big the speedup with this build is for p 4 and amd 64 cpus.
*

I could not find speed difference between SSE and SSE2 versions on my Pentium IV machine. Is there anybody who gets speed increase? The author wants to know the effect to determine whether if he should continue SSE2 version or not.

QUOTE(Sebastian Mares @ Nov 6 2004, 06:18 PM)
According to my tests...

ICL 8.1 Standard:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s


SSE:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s

*

Are SSE and SSE2 binaries your own builds? If so, don't forget to define a symbol __SSE__ to activate the optimization when compiling.
Sebastian Mares
QUOTE(esa372 @ Nov 6 2004, 04:10 PM)
I got a good increase, too...

ILC 8.1
CODE
       File length:  5m 23.0s
       Elapsed time: 0m 12.0s
       Rate:         26.9556
       Average bitrate: 175.3 kb/s

SSE2
CODE
       File length:  5m 23.0s
       Elapsed time: 0m 19.0s
       Rate:         17.0246
       Average bitrate: 175.3 kb/s



But I can't seem to get it to work on FLAC files...
CODE
ERROR: Input file "01.flac" is not a supported format.

Am I missing something??

Thanks,

~esa
*


Huh? The ICL 8.1 compile is faster. blink.gif

QUOTE(nyaochi @ Nov 6 2004, 06:20 PM)
QUOTE(Music Mixer @ Nov 6 2004, 04:10 PM)
Have you guys tested the SSE2 optimized build at http://homepage3.nifty.com/blacksword/
?

I wonder how big the speedup with this build is for p 4 and amd 64 cpus.
*

I could not find speed difference between SSE and SSE2 versions on my Pentium IV machine. Is there anybody who gets speed increase? The author wants to know the effect to determine whether if he should continue SSE2 version or not.

QUOTE(Sebastian Mares @ Nov 6 2004, 06:18 PM)
According to my tests...

ICL 8.1 Standard:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s


SSE:

CODE

File length:  4m 58,0s
Elapsed time: 0m 18,0s
Rate:         16,5778
Average bitrate: 236,7 kb/s

*

Are SSE and SSE2 binaries your own builds? If so, don't forget to define a symbol __SSE__ to activate the optimization when compiling.
*


Nope, they're not my own compiles. unsure.gif
esa372
QUOTE(Sebastian Mares @ Nov 6 2004, 11:05 AM)
Huh? The ICL 8.1 compile is faster. blink.gif
Whoops! No, that's a typo... I'll edit immediately...
kjoonlee
OK, here are some partial translations:

OggEnc_SSE_20041101ArcherB03.zip
Changes regarding/surrounding comments
Improved low-bitrate quality

Current problems are:
  • When encoding at low bitrates, treble quality suffers, and size bloat occurs.
  • Could hang immediately on running, depending on the environment
  • Bugs due to changes to comment handling? unsure.gif
nyaochi
QUOTE(kjoonlee @ Nov 7 2004, 03:17 AM)
OK, here are some partial translations:

OggEnc_SSE_20041101ArcherB03.zip
Changes regarding/surrounding comments
Improved low-bitrate quality

Current problems are:


  • When encoding at low bitrates, treble quality suffers, and size bloat occurs.

  • Could hang immediately on running, depending on the environment

  • Bugs due to changes to comment handling? unsure.gif


*

Thanks for the translation. I think all of the current problems listed above are solved in Archer B03. These problems existed in Archer B02.
QuantumKnot
IIRC, SSE2 is optimised for double point precision so maybe there isn't that much difference with SSE since libvorbis doesn't use many of them? unsure.gif
Benjamin Lebsanft
Tested on my AMD64 3400+, 1GB RAM

ICL 8.1:

File length: 4m 27.0s
Elapsed time: 0m 14.0s
Rate: 19.1190
Average bitrate: 132.9 kb/s

ICL 8.1 (John33):

File length: 4m 27.0s
Elapsed time: 0m 11.0s
Rate: 24.3333
Average bitrate: 132.9 kb/s

SSE/SSE2 Optimized:

File length: 4m 27.0s
Elapsed time: 0m 08.0s
Rate: 33.4583
Average bitrate: 132.9 kb/s

SSE2 optimization doesn't change encoding speed
john33
As QK says, there's very little use of double precision in libvorbis, so the use of SSE2 optimisation is virtually a waste of effort.
nyaochi
QUOTE(QuantumKnot @ Nov 7 2004, 10:55 AM)
IIRC, SSE2 is optimised for double point precision so maybe there isn't that much difference with SSE since libvorbis doesn't use many of them? unsure.gif
*

QUOTE(john33 @ Nov 7 2004, 06:41 PM)
As QK says, there's very little use of double precision in libvorbis, so the use of SSE2 optimisation is virtually a waste of effort.
*

Actually, he expects higher quality (or speed) of float to integer and vice-versa conversion but, at the same time, doubts the effect. I'll tell him these results.
Sebastian Mares
QUOTE(john33 @ Nov 7 2004, 10:41 AM)
As QK says, there's very little use of double precision in libvorbis, so the use of SSE2 optimisation is virtually a waste of effort.
*


That explains why my SSE and SSE2 tests achieve the same result.
Poromenos
OK, for the newb with no ability for critical thinking (me tongue.gif), would you recommend switching to this version from "OggEnc v2.3 (libvorbis 1.1.0)"? I'd like to have the extra speed, but if it introduces bugs I can wait smile.gif
I thought of encoding with both then comparing the files, but the size was a few bytes different and they were not identical (there were 80ish different bytes every Y bytes). What's that about?
QuantumKnot
QUOTE(Poromenos @ Nov 8 2004, 08:31 PM)
OK, for the newb with no ability for critical thinking (me tongue.gif), would you recommend switching to this version from "OggEnc v2.3 (libvorbis 1.1.0)"? I'd like to have the extra speed, but if it introduces bugs I can wait smile.gif
I thought of encoding with both then comparing the files, but the size was a few bytes different and they were not identical (there were 80ish different bytes every Y bytes). What's that about?
*


IMO, it's best to stick to the normal compile of oggenc. More testing is required.
Sebastian Mares
I see no speed gain when compared to the Pentium 4 builds from RareWares.
JensRex
I'd be more interested in decoder speedups - especially for portable devices. Vorbis playback in my Tungsten T3 eats battery like crazy.
Gecko
Here's a late reply. I tested on two titles and the results look great. The sse2 version offers zero speed increase; the numbers are exactly the same. System: Athlon 64 3000. Turns out I've been previously using Ogg Vorbis 1.1 rc1 from rarewares. Oh well. Quality level is 5.

Die fantastischen Vier - Mein Schwert [hip hop-ish, CD rip]
1.1rc1 - 14,9936
sse/sse2 - 22,9893

G&M Project - Sunday Afternoon (Nu Nrg Mix) [trance, wav previously decoded from mpc q7]
1.1rc1 - 15,7454
sse/sse2 - 27,9919

I was evaluating if I should use ogg or mp3 on my soon to be shipped iRiver biggrin.gif, so I will do a lot of transcoding. I don't know if the fact that I am using an allready lossy source accounts for the speed increase.

These speeds even surpass mpc encoding (usually 22-23x)! Lame 3.96.1 clocks in at about 8x for aps and 17x for apfs.

BTW: version "Archer B04" is out, which is claimed to be even a bit faster.
edit2: well, not for me. Speeds are identical to B03.
[solid]
how should i apply the patch? i get all hunks failed...
using linux, official libvorbis-1.1.0 and the same happens for both B03 and B04
ak
I remeber trying to apply it, there were bunch of whitespace diffs, so try 'patch -l ...'

Oops, actually, it was the case with current svn.

For 1.1.0 running dos2unix on patch should do.
[solid]
QUOTE(ak @ Nov 12 2004, 10:36 AM)
For 1.1.0 running dos2unix on patch should do.
*

oh crap it was that simple... haven't thought of that, thanks. compiling right now cool.gif
nyaochi
QUOTE(Sebastian Mares @ Nov 8 2004, 09:26 PM)
I see no speed gain when compared to the Pentium 4 builds from RareWares.
*

Weird...

QUOTE(Gecko @ Nov 12 2004, 06:49 AM)
BTW: version "Archer B04" is out, which is claimed to be even a bit faster.
edit2: well, not for me. Speeds are identical to B03.
*

I got slight speed increase (23.73x) from B03 (23.37x).
Benjamin Lebsanft
on the first run i got 38.2381x, on the second run 33.4583 which is the same as B03
jg123
It looks like the resample option is broken? I get a crash using the resample option on B04. I'm trying to resample a 16 kHZ stereo wav file to a -q0 44100 ogg.
kuniklo
Does anyone have the sse optimizations in the form of a patch to 1.1?

I'd like to try building a linux binary of this.
Bogalvator
The patch is the first file on the project web page:
http://homepage3.nifty.com/blacksword/


This is great stuff by the way, I hope that development / testing continues.
maacruz
QUOTE(Bogalvator @ Nov 15 2004, 08:02 PM)
The patch is the first file on the project web page:
http://homepage3.nifty.com/blacksword/


This is great stuff by the way, I hope that development / testing continues.
*


I have tryed it and doesn't work for me.
It does compile after some editing, but both enconding and playback are badly broken.
nyaochi
QUOTE(maacruz @ Nov 17 2004, 02:21 AM)
It does compile after some editing, but both enconding and playback are badly broken.
*

Could you give a better description of "badly broken"? Actually I didn't complie B04, but my own compile (ICL8.1) of B03 worked fine.
nyaochi
QUOTE(jg123 @ Nov 16 2004, 01:53 AM)
It looks like the resample option is broken? I get a crash using the resample option on B04. I'm trying to resample a 16 kHZ stereo wav file to a -q0 44100 ogg.
*

Archer Beta05 is released mainly to solve this problem.
- Use of libogg 1.1.2 (version up from 1.1.1)
- Fixed a crash (16 byte-alignement exception) of resample/downmix routines in audio.c (for oggenc and oggdropXPd)
- Update build script for automake/autoconf
- Activated FLAC reading suport in oggenc, using FLAC 1.1.1 (ICL compile)

QUOTE(nyaochi @ Nov 17 2004, 05:18 PM)
QUOTE(maacruz @ Nov 17 2004, 02:21 AM)
It does compile after some editing, but both enconding and playback are badly broken.
*

Could you give a better description of "badly broken"? Actually I didn't complie B04, but my own compile (ICL8.1) of B03 worked fine.
*

One thing I forget to mention. It is strongly recommended to use gcc 3.3. The patch does not work with gcc 3.4 and other versions.
Benjamin Lebsanft
Could anybody please provide a linux binary. As my box is using gcc 3.4.3 I am not able to compile it on my own smile.gif
Thanks
maacruz
QUOTE(nyaochi @ Nov 17 2004, 05:30 PM)
QUOTE(jg123 @ Nov 16 2004, 01:53 AM)
It looks like the resample option is broken? I get a crash using the resample option on B04. I'm trying to resample a 16 kHZ stereo wav file to a -q0 44100 ogg.
*

Archer Beta05 is released mainly to solve this problem.
- Use of libogg 1.1.2 (version up from 1.1.1)
- Fixed a crash (16 byte-alignement exception) of resample/downmix routines in audio.c (for oggenc and oggdropXPd)
- Update build script for automake/autoconf
- Activated FLAC reading suport in oggenc, using FLAC 1.1.1 (ICL compile)

QUOTE(nyaochi @ Nov 17 2004, 05:18 PM)
QUOTE(maacruz @ Nov 17 2004, 02:21 AM)
It does compile after some editing, but both enconding and playback are badly broken.
*

Could you give a better description of "badly broken"? Actually I didn't complie B04, but my own compile (ICL8.1) of B03 worked fine.
*

One thing I forget to mention. It is strongly recommended to use gcc 3.3. The patch does not work with gcc 3.4 and other versions.
*


Hi nyaochi

I have tested right now B05 and it applyed and compiled cleanly, but it does have the same problem than B04.
It encodes, but the result is a big file which sounds as noise (using normal oggenc castanets2.ogg is 97247 bytes, using oggenc-sse it is 221705 bytes).
Playing normal ogg files doesn't work either, it sounds as noise too, and segfaults when reaching the end of the file. Vorbisgain segfaults when reaching the end of a file.

I'm on a suse 9.1 linux system, gcc 3.3.3, glibc 2.3.3, libogg 1.1.2, athlon xp 2600 (Barthon core).


This is the gdb output
(gdb) run castanets2.ogg
Starting program: /usr/bin/ogg123 castanets2.ogg
Reading symbols from /usr/lib/libvorbisfile.so.3...(no debugging symbols found)...done.
...
Dispositivo de sonido: Advanced Linux Sound Architecture (ALSA) output

[New Thread 1087495088 (LWP 27008)]
Reproduciendo: castanets2.ogg
Ogg Vorbis stream: 2 channel, 44100 Hz
Tiempo: 00:06,63 [00:00,00] de 00:06,63 ( 0,0 kbps) Búfer de Salida 0,0% (EOS (Fin de flujo))
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1077510816 (LWP 27005)]
0x402e51bd in _int_free () from /lib/tls/libc.so.6
(gdb) bt
#0 0x402e51bd in _int_free () from /lib/tls/libc.so.6
#1 0x402e55fb in free () from /lib/tls/libc.so.6
#2 0x400552ed in vorbis_comment_clear () from /usr/lib/libvorbis.so.0
#3 0x00000000 in ?? ()
#4 0x400386f0 in ?? () from /usr/lib/libvorbisfile.so.3
#5 0x08070590 in ?? ()
#6 0x00000001 in ?? ()
#7 0x40036a96 in ov_clear () from /usr/lib/libvorbisfile.so.3
...
nyaochi
QUOTE(maacruz @ Nov 18 2004, 02:59 AM)
Hi nyaochi

I have tested right now B05 and it applyed and compiled cleanly, but it does have the same problem than B04.
It encodes, but the result is a big file which sounds as noise (using normal oggenc castanets2.ogg is 97247 bytes, using oggenc-sse it is 221705 bytes).
Playing normal ogg files doesn't work either, it sounds as noise too, and segfaults when reaching the end of the file. Vorbisgain segfaults when reaching the end of a file.

I'm on a suse 9.1 linux system, gcc 3.3.3, glibc 2.3.3, libogg 1.1.2, athlon xp 2600 (Barthon core).
*

Thank you for the detailed information. I've just got an email from the author. He found a bug around ov_read function that probably causes your crash. He also told me that he doesn't use Makefile generated by configure script but uses Makefile in Win32_MinGW that is based on a converted project from MSVC to compile it by gcc version 3.3.1 (mingw special 20030804-1).

I suppose linux support of B05 is not enough/adequate at present. So we have to inspect what causes bitrate-bloat/noise problem. Although I have Fedora Core 1 with gcc 3.3.1, unfortunately I'm not familiar with linux programing and have little time to debug it now. sad.gif The author recognizes this problem but anyone can solve this problem?
Sebastian Mares
QUOTE(nyaochi @ Nov 12 2004, 09:44 PM)
QUOTE(Sebastian Mares @ Nov 8 2004, 09:26 PM)
I see no speed gain when compared to the Pentium 4 builds from RareWares.
*

Weird...
*


In fact, the SSE/SSE2 optimized versions are slower by about 1x as seen here:

*
vearutop
does anyone have binary aotuvb3 oggenc w/ sse patch applied?
skamp
QUOTE(vearutop @ Dec 9 2004, 05:27 AM)
does anyone have binary aotuvb3 oggenc w/ sse patch applied?
*

I've uploaded linux binaries in this thread.
vearutop
thank you

do you have one for windows?
QuantumKnot
QUOTE(vearutop @ Dec 15 2004, 02:15 PM)
thank you

do you have one for windows?
*


Have a look at this page

http://homepage3.nifty.com/blacksword/

smile.gif
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.