Ogg Vorbis optimized for speed

Topic: Ogg Vorbis optimized for speed (Read 258259 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Ogg Vorbis optimized for speed

Reply #125 – 2005-10-22 20:06:46

I recall reading vorbis was all x87 code, that is, floating point, but not accelerated.

What the ICC compiler does is a process called autovectorisation, it's a very clever piece of software that examines routines and attempts to implement them using the faster SSE(2) instructions. At least, that is how i understand it.

What lancer does is replace certain standard routines in vorbis with hand written SSE implementations. This is not assembly (i think), but it is vectorisation (making use of SSE) done by a human.

The SSE instruction set works at a lower precision than the regular x87 instructions, but i don't think that's ever reduced sound quality in a noticeable way.

I'm not an expert on this, but i hope this explanation is accurate enough to answer your questions.

There is also an accelerated vorbis decoder, look at the first post of this topic.

Quote

- W.Dee's wuvorbisfile (Japanese only?): wuvorbis.dll is a fast Ogg Vorbis decoder with SSE and 3DNow!, which is a part of KiriKiri software (useful for developing multi-media contents or adventure games). wuvorbis.dll decodes 1.4x-1.8x faster (SSE) and 1.5x-1.9x faster (3DNow!) than official libvorbis.

Babelfish translation
I'll see if i can get this to work and bench it.
EDIT: can't make much sense of the japanese even with babelfish, the .dll supplied at least doesn't work as a regular vorbisfile.dll or vorbis.dll.

@toot - very impressive speedup, imagine if it were multithreading!

Ogg Vorbis optimized for speed

Reply #126 – 2005-11-18 15:37:28

Lancer 20051118 is out
http://homepage3.nifty.com/blacksword/

Ogg Vorbis optimized for speed

Reply #127 – 2005-11-18 15:48:48

Quote

I recall reading vorbis was all x87 code, that is, floating point, but not accelerated.

What the ICC compiler does is a process called autovectorisation, it's a very clever piece of software that examines routines and attempts to implement them using the faster SSE(2) instructions. At least, that is how i understand it.

What lancer does is replace certain standard routines in vorbis with hand written SSE implementations. This is not assembly (i think), but it is vectorisation (making use of SSE) done by a human.

The SSE instruction set works at a lower precision than the regular x87 instructions, but i don't think that's ever reduced sound quality in a noticeable way.

I'm not an expert on this, but i hope this explanation is accurate enough to answer your questions.
[a href="index.php?act=findpost&pid=336587"][{POST_SNAPBACK}][/a]

Basically nothing you said was correct.

1) autovectorisation is not the same as using SSE or SSE2 instructions
2) hand written SSE implementations are assembly or intrinsics
3) hand written (or automatically generated) SSE does *not* imply vectorisation
4) SSE or SSE2 does not automatically imply lower precision than floating point.

Ogg Vorbis optimized for speed

Reply #128 – 2005-11-18 16:24:59

Quote

Lancer 20051118 is out
http://homepage3.nifty.com/blacksword/
[a href="index.php?act=findpost&pid=342886"][{POST_SNAPBACK}][/a]

Fantastic! Thanks to all people involved!

Ogg Vorbis optimized for speed

Reply #129 – 2005-11-18 17:04:47

Here is a small Ogg Vorbis CLI encoder speed comparison between John33 and Lancer builds:

Code: [Select]

long_code_here = ';

oggenc2.6-aoTuVb4.5generic.exe
Elapsed time: 0m 11.0s
Rate:         10.0169
Average bitrate: 151.0 kb/s

oggenc2.6-aoTuVb4.5P4.exe
Elapsed time: 0m 07.0s
Rate:         15.7409
Average bitrate: 151.0 kb/s

OggEnc_SSE_20041213ArcherB10.exe
Elapsed time: 0m 05.0s
Rate:         22.0373
Average bitrate: 148.3 kb/s
 
OggEnc_SSE_20050320ArcherRC4.exe
Elapsed time: 0m 05.0s
Rate:         22.0373
Average bitrate: 148.3 kb/s
 
oggenc2_lancer20050528_1.exe
Elapsed time: 0m 4.44s
Rate:         24.8335
Average bitrate: 141.0 kb/s

oggenc2_lancer20050621.exe
Elapsed time: 0m 4.30s
Rate:         25.6426
Average bitrate: 151.0 kb/s

oggenc2_lancer20050709.exe
Elapsed time: 0m 4.23s
Rate:         26.0242
Average bitrate: 151.0 kb/s

oggenc2_lancer20051118.exe
Elapsed time: 0m 4.27s
Rate:         25.8290
Average bitrate: 151.0 kb/s

Test environment:
Pentium4 2.4GHZ, Windows XP SP2, 512MB ddr266 sdram.
Test with 18.5 MB, 44.1khz, stereo, 1min 50sec audio file, and -q4 switch.

NOTE: result above might not accurate...

Ogg Vorbis optimized for speed

Reply #130 – 2005-11-18 19:20:29

Quote

Quote
I recall reading vorbis was all x87 code, that is, floating point, but not accelerated.

What the ICC compiler does is a process called autovectorisation, it's a very clever piece of software that examines routines and attempts to implement them using the faster SSE(2) instructions. At least, that is how i understand it.

What lancer does is replace certain standard routines in vorbis with hand written SSE implementations. This is not assembly (i think), but it is vectorisation (making use of SSE) done by a human.

The SSE instruction set works at a lower precision than the regular x87 instructions, but i don't think that's ever reduced sound quality in a noticeable way.

I'm not an expert on this, but i hope this explanation is accurate enough to answer your questions.
[a href="index.php?act=findpost&pid=336587"][{POST_SNAPBACK}][/a]

Basically nothing you said was correct.

1) autovectorisation is not the same as using SSE or SSE2 instructions
2) hand written SSE implementations are assembly or intrinsics
3) hand written (or automatically generated) SSE does *not* imply vectorisation
4) SSE or SSE2 does not automatically imply lower precision than floating point.
[a href="index.php?act=findpost&pid=342888"][{POST_SNAPBACK}][/a]

To explain:

3DNow, SSE, SSE2 are alternate instruction sets for floating point processing. These instruction sets have some major advantages over the old x87 mode:

1) They have register based access, instead of stack based
2) They have the *possibility* to operate on 2 or 4 values at the same time (vectorisation)

SSE and 3DNow have 32 bit accuracy, SSE2 has 64 bit accuracy. x87 has 32 or 64 bit accuracy and a possibility (that shouldn't be used and I'm pretty sure vorbis doesn't use it!) to do 80 bit accuracy arithmetic.

Using these instruction sets can be done in the following manner: code for them manually (in assembler or with instrinsics), use a compiler that can use the SSE(2) instructions for floating point instead of x87, or use a compiler than can *vectorize* computations for SSE/SSE2.

Currently (besides manually writing in assembly), ICC is the best at vectorization, and some very recent GCC's have the possibility too. MSVC2005 and older GCC's have the possibility to generate SSE(2) floating point instructions (without vectorisation).

Ogg Vorbis optimized for speed

Reply #131 – 2005-11-19 04:44:19

Quote

Lancer 20051118 is out
http://homepage3.nifty.com/blacksword/
[a href="index.php?act=findpost&pid=342886"][{POST_SNAPBACK}][/a]

For some reason on the site right now the links are crossed out and removed.

EDIT: I see why now. aoTuV b4.51 came out.

Ogg Vorbis optimized for speed

Reply #132 – 2005-11-19 07:40:09

Quote

Lancer 20051118 is out
http://homepage3.nifty.com/blacksword/
[{POST_SNAPBACK}][/a]

really ?

[a href="http://homepage3.nifty.com/blacksword/index_e.htm]Ogg Vorbis acceleration project[/url]

Ogg Vorbis optimized for speed

Reply #133 – 2005-11-19 07:56:46

they must've pulled it, maybe since 4.51 bugfix was released almost simultaneously.

Ogg Vorbis optimized for speed

Reply #134 – 2005-11-19 09:48:09

Quote

they must've pulled it, maybe since 4.51 bugfix was released almost simultaneously.
[a href="index.php?act=findpost&pid=343048"][{POST_SNAPBACK}][/a]

It looks like it.. according to google's surprisingly legible translation..

November of 2005 19th

Release is discontinued to completion of the aoTu V beta4.51 base.

Ogg Vorbis optimized for speed

Reply #135 – 2005-11-19 13:34:02

Thanks for the explanation, Garf. Reading mostly about the use of 3DNow/SSE with regard to 3D work i didn't realise vectorisation was only a possibility. Or that x87 had precisions other than 80 bit.

Ogg Vorbis optimized for speed

Reply #136 – 2005-11-21 21:59:52

OK, lancer_20051121 patches against aotuv4.51 are out.

Can I batch the source under the Linux ?

What would be the exact command and what I need (besides aotuv source) ?

Ogg Vorbis optimized for speed

Reply #137 – 2005-11-25 01:09:21

I have downloaded Lancer_20051121 and tested the OggEnc2.exe. Here's the log of what I have done (_j33 is John33's compiled, _lancer is Lancer version):

Code: [Select]

D:\Music\!Reprocess>oggenc_j33 -q 2 --output=Mamma_Mia_j33.ogg "ABBA - Mamma Mia
.wav"
Opening with wav module: WAV file reader
Encoding "ABBA - Mamma Mia.wav" to
         "Mamma_Mia_j33.ogg"
at quality 2.00
        [ 99.7%] [ 0m00s remaining] -

Done encoding file "Mamma_Mia_j33.ogg"

        File length:  3m 32.0s
        Elapsed time: 0m 16.0s
        Rate:         13.3078
        Average bitrate: 101.4 kb/s


D:\Music\!Reprocess>oggenc_lancer -q 2 --output=Mamma_Mia_lancer.ogg "ABBA - Mam
ma Mia.wav"
Opening with wav module: WAV file reader
Encoding "ABBA - Mamma Mia.wav" to
         "Mamma_Mia_lancer.ogg"
at quality 2.00
        [ 99.7%] [ 0m00s remaining] -

Done encoding file "Mamma_Mia_lancer.ogg"

        File length:  3m 32.00s
        Elapsed time: 0m 8.78s
        Rate:         24.2484
        Average bitrate: 101.4 kb/s

Wow! It's amazingly fast (I use my brother's AthlonXP 2400+). However, the next step I took made me pause:

Code: [Select]

D:\Music\!Reprocess>dir M*.ogg
 Volume in drive D is Data
 Volume Serial Number is 20E6-C9A1

 Directory of D:\Music\!Reprocess

2005-11-25  01:10         2,701,832 Mamma_Mia_j33.ogg
2005-11-25  01:10         2,701,784 Mamma_Mia_lancer.ogg
               2 File(s)      5,403,616 bytes
               0 Dir(s)   2,493,083,648 bytes free

Whoa! Significant difference? Can't be because of different comment, no? I go check with EditPlus, and I think the files mostly are identical. So I decode both to WAVs and got the same size:

Code: [Select]

D:\Music\!Reprocess>dir *.wav
 Volume in drive D is Data
 Volume Serial Number is 20E6-C9A1

 Directory of D:\Music\!Reprocess

2005-11-15  02:42        37,560,040 ABBA - Mamma Mia.wav
2005-11-25  01:18        37,560,040 Mamma_Mia_j33.wav
2005-11-25  01:18        37,560,040 Mamma_Mia_lancer.wav
               4 File(s)    253,711,964 bytes
               0 Dir(s)   2,493,083,648 bytes free

Same as original. Not knowing what else to do, I try EAQUAL:

Code: [Select]

D:\Music\!Reprocess>eaqual -fref Mamma_Mia_j33.wav -ftest Mamma_Mia_lancer.wav

EAQUAL - Evaluation of Audio Quality
Version:        0.1.3alpha
Author:         Alexander Lerch, zplane.development
_______________________________________________________
Reference File:         Mamma_Mia_j33.wav
Test File:              Mamma_Mia_lancer.wav
Sample Rate:            44100
Number of Channels:     2

Press Escape to cancel...

Processed:              212.93 seconds of audio file
Time elapsed:   82.25

Resulting ODG:   0.11
Resulting DIX:   3.64

BandwidthRef    16082.5596
BandwidthTest   16082.5192
NMR             -34.2508
WinModDiff1     0.3679
ADB             -0.1596
EHS             0.0345
AvgModDiff1     0.1880
AvgModDiff2     0.3213
NoiseLoud       0.0132
MFPD            0.9995
RDF             0.0000

And it seems there are differences.

I tried listening to the results but to my ears they sound the same.

Anyone can shed a light as to why they differ?

[span style=\'font-size:8pt;line-height:100%\']EDIT: Changed CODE to CODEBOX that's all[/span]

Ogg Vorbis optimized for speed

Reply #138 – 2005-11-25 01:34:37

Quote

Anyone can shed a light as to why they differ?

Read this page with machine translation if you really want to know the reason (the first item in Frequently Asked Questions)
http://homepage3.nifty.com/blacksword/readme_j.htm

In short, SSE arithmetic has 32bit precision while FPU (i.e., without SSE optimization/compile) arithmetic has 80bit precision. The computational error in floating point arithmetic may make the difference but is so small that you probably cannot hear the difference. I bet you also get a difference between John33's compile and reference binary distributed by Aoyumi.

Ogg Vorbis optimized for speed

Reply #139 – 2005-11-25 06:56:29

Quote

In short, SSE arithmetic has 32bit precision while FPU (i.e., without SSE optimization/compile) arithmetic has 80bit precision.
[a href="index.php?act=findpost&pid=344771"][{POST_SNAPBACK}][/a]

SSE2 has 64 bit accuracy, and the FPU is generally used with only 64 bit accuracy (using 80 bit mode is not possible in a portable way, and as I said, vorbis is not doing it).

Note that AMD64/EM64T use SSE/SSE2 exclusively instead of the FPU.

But yes, in this case the difference is likely just minor rounding error. Note that positive ODG means that there is no audible difference (actually: encoded sample is better than the original, but that's a limitation in the way EAQUAL works).

Ogg Vorbis optimized for speed

Reply #140 – 2005-11-25 12:47:25

Whoa! Thanks for the clarification I was afraid that Lancer optimizations is buggy and will degrade the output, but this puts my fear to rest. I am very amazed at the encoding speed increase and will change over to Lancer (oggenc, oggdrop, and libvorbis.dll).

One question: How do I decode the result of EAQUAL? Any pointer will be appreciated. Thanks a lot.

Ogg Vorbis optimized for speed

Reply #141 – 2005-11-25 13:48:03

Quote

Can I patch the source under the Linux ?

What would be the exact command and what I need (besides aotuv source) ?
[a href="index.php?act=findpost&pid=343961"][{POST_SNAPBACK}][/a]

Ogg Vorbis optimized for speed

Reply #142 – 2005-11-25 14:56:46

Quote

One question: How do I decode the result of EAQUAL? Any pointer will be appreciated. Thanks a lot.
[a href="index.php?act=findpost&pid=344879"][{POST_SNAPBACK}][/a]

ODG = Objective difference grade

From memory

Code: [Select]

 0 = Imperceptible
-1 = Perceptible but not annoying
-2 = Slightly annoying
-3 = Annoying
-4 = Very annoying

Positive value = better than perfect

Ogg Vorbis optimized for speed

Reply #143 – 2005-12-03 19:38:17

OK, managed to patch aotuv sources with lancer dif (don't know what went wrong last time), but now oggenc segfaults. Switching back to original aotuv helps.

Any comments ?

My box is amd64 Gentoo with gcc 3.4.4.

Ogg Vorbis optimized for speed

Reply #144 – 2005-12-23 11:50:57

Try to compile by gcc 3.3.x.

For me gcc 3.4 can't compile sources, 4.0 compiles but oggenc display mystic error on start but 3.3 compiles and oggenc works after it.

I'm using Ubuntu 5.10 with gcc 3.3.6, 3.4.4 and 4.0.1 (acutally 4.0.2 pre) on Athlon XP 2200+. libvorbis 1.1.2 compiled by gcc 4 with default package options (--host=i486-linux-gnu) gives ~11x, with -march=athon-xp -mfpmath=sse about 14x, with lancer patches by gcc 3.3 with -march=athon-xp gives about 17x.

Ogg Vorbis optimized for speed

Reply #145 – 2006-01-01 08:00:41

With the late November Lancer oggenc2 , I get MediaCoder encoding speeds from Flacs around 29x on my AMD 3300+ Win X64 system (q 6.16).
I know this thread is about speed, but I wonder if others disagree with my perception that quality now is comparable to MPC at rates around 200 kbps.

Ogg Vorbis optimized for speed

Reply #146 – 2006-01-01 15:56:57

The quality at which -q setting...?

Ogg Vorbis optimized for speed

Reply #147 – 2006-01-01 20:43:29

Quote

The quality at which -q setting...?
[a href="index.php?act=findpost&pid=353740"][{POST_SNAPBACK}][/a]

Say, in the range of nominal 200 kbps, which is q 6.16 to 6.24.
I recall that MPC was considered near as dammit to transparent at q 8. So, I'm getting at whether there's a sweet spot in the latest Japanese tweaked Ogg Vorbis encoders in that 6 to 8 range.
I'm pretty sure that that glitch on the 6.0 boundary for the official release that made those just north sound much better has been solved...

Ogg Vorbis optimized for speed

Reply #148 – 2006-01-02 05:21:13

I know that I've ditched mpc "insane" for ogg q7. The poor seeking and limited hardware support for mpc and the improvements in Vorbis are what convinced me.

I don't think Guru has tested beyond the 170 to 180 range yet, which showed ogg to be on par with (and in some cases better than) mpc.

Ogg Vorbis optimized for speed

Reply #149 – 2006-01-02 08:33:29

Quote

I don't think Guru has tested beyond the 170 to 180 range yet, which showed ogg to be on par with (and in some cases better than) mpc.

No, need to it's waste of time IMO. Most people with the exception of a few like GuruB can tell the difference, I can't. If it was low-bitrate test then sure why not

Notice