Help - Search - Members - Calendar
Full Version: Ogg Vorbis optimized for speed
Hydrogenaudio Forums > Lossy Audio Compression > Ogg Vorbis > Ogg Vorbis - Tech
Pages: 1, 2, 3, 4, 5, 6, 7
HbG
QUOTE(rutra80 @ Jun 7 2005, 02:27 PM)
Lancer is probably packed with UPX or something and aoTuV is not.
*



Lancer's DLL's are pretty big too, 3 meg unpacked, 450kb zipped. Beats me why they're that inflated.
VEG
Current Lancer (Archer) version is stable release? Why changed name of this tune?
jorsol
The change of the name form Archer to Lancer is because it uses the AoTuV pre-beta 4, the Archer versions uses Beta 3... thats why the change of the name, or at least I suppose that... it is pretty stable but it maybe have various bugs to be fixed... and in the other hand it uses a pre-beta which maybe have some others problems...
yong
new version is out,
Lancer 20050621(Based on aotuv-b4_20050617) smile.gif
sh1leshk4
Many thanks for the heads up. =)
Will test it out in a moment...
Biont
Almost a week has passed. Any test results?
Josef K.
QUOTE(Biont @ Jun 27 2005, 10:39 AM)
Almost a week has passed. Any test results?
*

I use it regularly without any problems (mostly q3 & q4), the speed is fantastic. Also I've noticed that it uses only about 90% CPU when running on my AMD 2700+ machine, as Archer spent full power. biggrin.gif
aspifox
QUOTE(Josef K. @ Jun 27 2005, 10:58 AM)
I use it regularly without any problems (mostly q3 & q4), the speed is fantastic. Also I've noticed that it uses only about 90% CPU when running on my AMD 2700+ machine, as Archer spent full power. biggrin.gif
*


Ooh. That probably implies that it's fast enough that it's actually sometimes blocking on IO. Impressive!
sh1leshk4
QUOTE(Biont @ Jun 27 2005, 03:39 PM)
Almost a week has passed. Any test results?
*


Well, compared to the Lancer that uses aoTuV pre-beta 4, no real noticeable difference.
U might wanna check that out a few posts (or pages?) back about previous Lancer's performance.
To the Archer, it's just like Josef said.

Anyway, the speed gain (from the auTuVb4 compiles found in Rarewares.org) on slower PIII systems is adequate.
(I tested it on a PIII 600MHz)
No numbers yet (since I kinda forgot...), but I think it was about 1.15x to 1.30x faster.
As always, cmiiw. =)
HbG
From http://www.tom.womack.net/x86FAQ/faq_features.html

QUOTE
For the P3, Intel skimped somewhat on the implementation, using only a two-wide ALU, so the average performance of SSE and 3DNow will be the same - I've constructed sequences of instructions which are faster on 3DNow. It's possible they'll use a four-wide one on later chips, which would make SSE roughly twice as fast as 3DNow.


This is also why P3's are notoriously poor on certain games such as UT2003/4.
Tropican
New Lancer build based on the aoTuVb4 library merged with libvorbis 1.1.1. Previous was based on aoTuVb4 with libvorbis 1.1.0

http://homepage3.nifty.com/blacksword/index_e.htm as always
wjdashwood
Just tried the latest version to replace the built in Foobar and it's well over twice as fast, especially when converting from FLAC. Amazing! biggrin.gif
judfilm
Just finished a test with besweet. Encoding time dropped from 1:54 to 1:13 (mins:secs).
judfilm
FYI - New 'Lancer' builds of oggdropXPd v1.8.6 and libvorbis.dll

http://homepage3.nifty.com/blacksword/index_e.htm as always
de Mon
I have a question. Will Lancer and P-III optimized (from rarewares.org) versions work and have any gain on Celeron 128kb cache (not Tualatin)?
jorsol
QUOTE(de Mon @ Sep 4 2005, 02:14 PM)
I have a question. Will Lancer and P-III optimized (from rarewares.org) versions work and have any gain on Celeron 128kb cache (not Tualatin)?
*


Only if your Celeron support SSE.... try to use a program like cpuz to see if it have SSE instruction.
zver
Got a question guys
I did some encodings with foobar0.8.3 and about 25 different songs,using built in vorbis which is 1.1 and using lancer which is merged 111 and aotuvb4 and im getting increase on all samples by 5-10% in size,actuaally 20 samples were 10% and rest was between5-10%.It was a classic rock songs,using p4 and xp-sp2.
It is a quite faster which is nice,but what confusing me is that on beginning of the thread all tests shows the same bitrate-i encoded at q5 everything was default from foobar biggrin.gif
sh1leshk4
By 'built-in', did u mean the official vorbis libraries?
The difference is probably caused by the different tunings used.
The official one doesn't use AoTuV b4 tunings yet.
(is it still b2 or something...? I forgot...)
zver
QUOTE(sh1leshk4 @ Sep 4 2005, 07:01 PM)
By 'built-in', did u mean the official vorbis libraries?
The difference is probably caused by the different tunings used.
The official one doesn't use AoTuV b4 tunings yet.
(is it still b2 or something...? I forgot...)
*


I meant the one which comes by default in foobar allready configured in diskwriter
mrq and foobar reports it as 1.1.
Both were encoded with default preferencies-q5 and no other parametars
HbG
The bitrate difference you're seeing is aotuv vs 1.1, lancer may change bitrates compared to the regular aotuv, but only by a tiny bit.
sh1leshk4
QUOTE(zver @ Sep 6 2005, 06:09 AM)
I meant the one which comes by default in foobar allready configured in diskwriter
mrq and foobar reports it as 1.1.
Both were encoded with default preferencies-q5 and no other parametars
*


Exactly.
I don't think that version (1.1.0) already use AoTuV b4 tunings.
So if the resulting file size difference is quite big, it's probably 'cause of different tunings used.
de Mon
QUOTE(jorsol @ Sep 4 2005, 04:16 PM)
QUOTE(de Mon @ Sep 4 2005, 02:14 PM)
I have a question. Will Lancer and P-III optimized (from rarewares.org) versions work and have any gain on Celeron 128kb cache (not Tualatin)?
*


Only if your Celeron support SSE.... try to use a program like cpuz to see if it have SSE instruction.
*



Yes, CPUZ says my CPU has SSE, I tried 'Lancer oggenc' and it is realy faster than P-III compile. smile.gif
However I also tried 'Lancer OggDropXPd' and it doesn't work. sad.gif
When I drop wav's in it nothing happens. Does anybody know why?
My PC is Intel Celeron 1000 MHz (not Tualatin) with Windows 98SE. If anybody have Windows 98SE installed - please check - does 'Lancer OggDropXPd' work?
Thanks.
PatchWorKs
QUOTE
My PC is Intel Celeron 1000 MHz (not Tualatin) with Windows 98SE.

Try these updates...
toot
Nice speed incrase here on AMD x2 4400+ smile.gif

aoTuVb4 - no enhancements (oggenc)
File length: 72m 28.0s
Elapsed time: 5m 07.0s
Rate: 14.1643
Average bitrate: 192.2 kb/s

aoTuVb4 - SEE version
(oggenc2)
File length: 72m 28.0s
Elapsed time: 4m 16.0s
Rate: 16.9861
Average bitrate: 192.2 kb/s

aoTuVb4 - SEE2 version (oggenc2)
File length: 72m 28.0s
Elapsed time: 3m 30.0s
Rate: 20.7068
Average bitrate: 192.2 kb/s

lancer20050709 (oggenc2)
File length: 72m 28.00s
Elapsed time: 2m 3.66s
Rate: 35.1655
Average bitrate: 192.2 kb/s
de Mon
Is this optimizing done via implementing SSE and SSE2 only, or also via assembling some parts of code?
Can such optimizing work be done with Ogg Vorbis decoder?
HbG
I recall reading vorbis was all x87 code, that is, floating point, but not accelerated.

What the ICC compiler does is a process called autovectorisation, it's a very clever piece of software that examines routines and attempts to implement them using the faster SSE(2) instructions. At least, that is how i understand it.

What lancer does is replace certain standard routines in vorbis with hand written SSE implementations. This is not assembly (i think), but it is vectorisation (making use of SSE) done by a human.

The SSE instruction set works at a lower precision than the regular x87 instructions, but i don't think that's ever reduced sound quality in a noticeable way.

I'm not an expert on this, but i hope this explanation is accurate enough to answer your questions.

There is also an accelerated vorbis decoder, look at the first post of this topic.

QUOTE
- W.Dee's wuvorbisfile (Japanese only?): wuvorbis.dll is a fast Ogg Vorbis decoder with SSE and 3DNow!, which is a part of KiriKiri software (useful for developing multi-media contents or adventure games). wuvorbis.dll decodes 1.4x-1.8x faster (SSE) and 1.5x-1.9x faster (3DNow!) than official libvorbis.
Babelfish translation
I'll see if i can get this to work and bench it.
EDIT: can't make much sense of the japanese even with babelfish, the .dll supplied at least doesn't work as a regular vorbisfile.dll or vorbis.dll.

@toot - very impressive speedup, imagine if it were multithreading!
yong
Lancer 20051118 is out smile.gif
http://homepage3.nifty.com/blacksword/
Garf
QUOTE(HbG @ Oct 22 2005, 09:06 PM)
I recall reading vorbis was all x87 code, that is, floating point, but not accelerated.

What the ICC compiler does is a process called autovectorisation, it's a very clever piece of software that examines routines and attempts to implement them using the faster SSE(2) instructions. At least, that is how i understand it.

What lancer does is replace certain standard routines in vorbis with hand written SSE implementations. This is not assembly (i think), but it is vectorisation (making use of SSE) done by a human.

The SSE instruction set works at a lower precision than the regular x87 instructions, but i don't think that's ever reduced sound quality in a noticeable way.

I'm not an expert on this, but i hope this explanation is accurate enough to answer your questions.
*



Basically nothing you said was correct.

1) autovectorisation is not the same as using SSE or SSE2 instructions
2) hand written SSE implementations are assembly or intrinsics
3) hand written (or automatically generated) SSE does *not* imply vectorisation
4) SSE or SSE2 does not automatically imply lower precision than floating point.
Gecko
QUOTE(yong @ Nov 18 2005, 05:37 PM)

Fantastic! Thanks to all people involved!
yong
Here is a small Ogg Vorbis CLI encoder speed comparison between John33 and Lancer builds:
CODE
long_code_here = ';

oggenc2.6-aoTuVb4.5generic.exe
Elapsed time: 0m 11.0s
Rate: 10.0169
Average bitrate: 151.0 kb/s

oggenc2.6-aoTuVb4.5P4.exe
Elapsed time: 0m 07.0s
Rate: 15.7409
Average bitrate: 151.0 kb/s

OggEnc_SSE_20041213ArcherB10.exe
Elapsed time: 0m 05.0s
Rate: 22.0373
Average bitrate: 148.3 kb/s

OggEnc_SSE_20050320ArcherRC4.exe
Elapsed time: 0m 05.0s
Rate: 22.0373
Average bitrate: 148.3 kb/s

oggenc2_lancer20050528_1.exe
Elapsed time: 0m 4.44s
Rate: 24.8335
Average bitrate: 141.0 kb/s

oggenc2_lancer20050621.exe
Elapsed time: 0m 4.30s
Rate: 25.6426
Average bitrate: 151.0 kb/s

oggenc2_lancer20050709.exe
Elapsed time: 0m 4.23s
Rate: 26.0242
Average bitrate: 151.0 kb/s

oggenc2_lancer20051118.exe
Elapsed time: 0m 4.27s
Rate: 25.8290
Average bitrate: 151.0 kb/s


Test environment:
Pentium4 2.4GHZ, Windows XP SP2, 512MB ddr266 sdram.
Test with 18.5 MB, 44.1khz, stereo, 1min 50sec audio file, and -q4 switch.

NOTE: result above might not accurate...sweat.gif
Garf
QUOTE(Garf @ Nov 18 2005, 05:48 PM)
QUOTE(HbG @ Oct 22 2005, 09:06 PM)
I recall reading vorbis was all x87 code, that is, floating point, but not accelerated.

What the ICC compiler does is a process called autovectorisation, it's a very clever piece of software that examines routines and attempts to implement them using the faster SSE(2) instructions. At least, that is how i understand it.

What lancer does is replace certain standard routines in vorbis with hand written SSE implementations. This is not assembly (i think), but it is vectorisation (making use of SSE) done by a human.

The SSE instruction set works at a lower precision than the regular x87 instructions, but i don't think that's ever reduced sound quality in a noticeable way.

I'm not an expert on this, but i hope this explanation is accurate enough to answer your questions.
*



Basically nothing you said was correct.

1) autovectorisation is not the same as using SSE or SSE2 instructions
2) hand written SSE implementations are assembly or intrinsics
3) hand written (or automatically generated) SSE does *not* imply vectorisation
4) SSE or SSE2 does not automatically imply lower precision than floating point.
*



To explain:

3DNow, SSE, SSE2 are alternate instruction sets for floating point processing. These instruction sets have some major advantages over the old x87 mode:

1) They have register based access, instead of stack based
2) They have the *possibility* to operate on 2 or 4 values at the same time (vectorisation)

SSE and 3DNow have 32 bit accuracy, SSE2 has 64 bit accuracy. x87 has 32 or 64 bit accuracy and a possibility (that shouldn't be used and I'm pretty sure vorbis doesn't use it!) to do 80 bit accuracy arithmetic.

Using these instruction sets can be done in the following manner: code for them manually (in assembler or with instrinsics), use a compiler that can use the SSE(2) instructions for floating point instead of x87, or use a compiler than can *vectorize* computations for SSE/SSE2.

Currently (besides manually writing in assembly), ICC is the best at vectorization, and some very recent GCC's have the possibility too. MSVC2005 and older GCC's have the possibility to generate SSE(2) floating point instructions (without vectorisation).
DreamTactix291
QUOTE(yong @ Nov 18 2005, 09:37 AM)
For some reason on the site right now the links are crossed out and removed.

EDIT: I see why now. aoTuV b4.51 came out.
PatchWorKs
QUOTE(yong @ Nov 18 2005, 04:37 PM)

huh.gif really ?

Ogg Vorbis acceleration project
vinnie97
they must've pulled it, maybe since 4.51 bugfix was released almost simultaneously.
toot
QUOTE(vinnie97 @ Nov 19 2005, 08:56 AM)
they must've pulled it, maybe since 4.51 bugfix was released almost simultaneously.
*



It looks like it.. according to google's surprisingly legible translation..


November of 2005 19th

Release is discontinued to completion of the aoTu V beta4.51 base.
HbG
Thanks for the explanation, Garf. Reading mostly about the use of 3DNow/SSE with regard to 3D work i didn't realise vectorisation was only a possibility. Or that x87 had precisions other than 80 bit.
suur13
OK, lancer_20051121 patches against aotuv4.51 are out.

Can I batch the source under the Linux ?

What would be the exact command and what I need (besides aotuv source) ?
pepoluan
I have downloaded Lancer_20051121 and tested the OggEnc2.exe. Here's the log of what I have done (_j33 is John33's compiled, _lancer is Lancer version):

CODE

D:\Music\!Reprocess>oggenc_j33 -q 2 --output=Mamma_Mia_j33.ogg "ABBA - Mamma Mia
.wav"
Opening with wav module: WAV file reader
Encoding "ABBA - Mamma Mia.wav" to
        "Mamma_Mia_j33.ogg"
at quality 2.00
       [ 99.7%] [ 0m00s remaining] -

Done encoding file "Mamma_Mia_j33.ogg"

       File length:  3m 32.0s
       Elapsed time: 0m 16.0s
       Rate:         13.3078
       Average bitrate: 101.4 kb/s


D:\Music\!Reprocess>oggenc_lancer -q 2 --output=Mamma_Mia_lancer.ogg "ABBA - Mam
ma Mia.wav"
Opening with wav module: WAV file reader
Encoding "ABBA - Mamma Mia.wav" to
        "Mamma_Mia_lancer.ogg"
at quality 2.00
       [ 99.7%] [ 0m00s remaining] -

Done encoding file "Mamma_Mia_lancer.ogg"

       File length:  3m 32.00s
       Elapsed time: 0m 8.78s
       Rate:         24.2484
       Average bitrate: 101.4 kb/s


Wow! It's amazingly fast (I use my brother's AthlonXP 2400+). However, the next step I took made me pause:

CODE

D:\Music\!Reprocess>dir M*.ogg
Volume in drive D is Data
Volume Serial Number is 20E6-C9A1

Directory of D:\Music\!Reprocess

2005-11-25  01:10         2,701,832 Mamma_Mia_j33.ogg
2005-11-25  01:10         2,701,784 Mamma_Mia_lancer.ogg
              2 File(s)      5,403,616 bytes
              0 Dir(s)   2,493,083,648 bytes free


Whoa! Significant difference? Can't be because of different comment, no? I go check with EditPlus, and I think the files mostly are identical. So I decode both to WAVs and got the same size:

CODE

D:\Music\!Reprocess>dir *.wav
Volume in drive D is Data
Volume Serial Number is 20E6-C9A1

Directory of D:\Music\!Reprocess

2005-11-15  02:42        37,560,040 ABBA - Mamma Mia.wav
2005-11-25  01:18        37,560,040 Mamma_Mia_j33.wav
2005-11-25  01:18        37,560,040 Mamma_Mia_lancer.wav
              4 File(s)    253,711,964 bytes
              0 Dir(s)   2,493,083,648 bytes free


Same as original. Not knowing what else to do, I try EAQUAL:

CODE

D:\Music\!Reprocess>eaqual -fref Mamma_Mia_j33.wav -ftest Mamma_Mia_lancer.wav

EAQUAL - Evaluation of Audio Quality
Version:        0.1.3alpha
Author:         Alexander Lerch, zplane.development
_______________________________________________________
Reference File:         Mamma_Mia_j33.wav
Test File:              Mamma_Mia_lancer.wav
Sample Rate:            44100
Number of Channels:     2

Press Escape to cancel...

Processed:              212.93 seconds of audio file
Time elapsed:   82.25

Resulting ODG:   0.11
Resulting DIX:   3.64

BandwidthRef    16082.5596
BandwidthTest   16082.5192
NMR             -34.2508
WinModDiff1     0.3679
ADB             -0.1596
EHS             0.0345
AvgModDiff1     0.1880
AvgModDiff2     0.3213
NoiseLoud       0.0132
MFPD            0.9995
RDF             0.0000


And it seems there are differences.

I tried listening to the results but to my ears they sound the same.

Anyone can shed a light as to why they differ?

EDIT: Changed CODE to CODEBOX that's all
nyaochi
QUOTE(pepoluan @ Nov 25 2005, 10:09 AM)
Anyone can shed a light as to why they differ?

Read this page with machine translation if you really want to know the reason (the first item in Frequently Asked Questions)
http://homepage3.nifty.com/blacksword/readme_j.htm

In short, SSE arithmetic has 32bit precision while FPU (i.e., without SSE optimization/compile) arithmetic has 80bit precision. The computational error in floating point arithmetic may make the difference but is so small that you probably cannot hear the difference. I bet you also get a difference between John33's compile and reference binary distributed by Aoyumi.
Garf
QUOTE(nyaochi @ Nov 25 2005, 03:34 AM)
In short, SSE arithmetic has 32bit precision while FPU (i.e., without SSE optimization/compile) arithmetic has 80bit precision.
*



SSE2 has 64 bit accuracy, and the FPU is generally used with only 64 bit accuracy (using 80 bit mode is not possible in a portable way, and as I said, vorbis is not doing it).

Note that AMD64/EM64T use SSE/SSE2 exclusively instead of the FPU.

But yes, in this case the difference is likely just minor rounding error. Note that positive ODG means that there is no audible difference (actually: encoded sample is better than the original, but that's a limitation in the way EAQUAL works).
pepoluan
Whoa! Thanks for the clarification smile.gif I was afraid that Lancer optimizations is buggy and will degrade the output, but this puts my fear to rest. I am very amazed at the encoding speed increase and will change over to Lancer (oggenc, oggdrop, and libvorbis.dll).

One question: How do I decode the result of EAQUAL? Any pointer will be appreciated. Thanks a lot.
suur13
QUOTE(suur13 @ Nov 22 2005, 12:59 AM)
Can I patch the source under the Linux ?

What would be the exact command and what I need (besides aotuv source) ?
*


sad.gif
Garf
QUOTE(pepoluan @ Nov 25 2005, 02:47 PM)
One question: How do I decode the result of EAQUAL? Any pointer will be appreciated. Thanks a lot.
*



ODG = Objective difference grade

From memory

CODE

0 = Imperceptible
-1 = Perceptible but not annoying
-2 = Slightly annoying
-3 = Annoying
-4 = Very annoying


Positive value = better than perfect wink.gif
suur13
OK, managed to patch aotuv sources with lancer dif (don't know what went wrong last time), but now oggenc segfaults. Switching back to original aotuv helps.

Any comments ?

My box is amd64 Gentoo with gcc 3.4.4.
iGold
Try to compile by gcc 3.3.x.

For me gcc 3.4 can't compile sources, 4.0 compiles but oggenc display mystic error on start but 3.3 compiles and oggenc works after it.

I'm using Ubuntu 5.10 with gcc 3.3.6, 3.4.4 and 4.0.1 (acutally 4.0.2 pre) on Athlon XP 2200+. libvorbis 1.1.2 compiled by gcc 4 with default package options (--host=i486-linux-gnu) gives ~11x, with -march=athon-xp -mfpmath=sse about 14x, with lancer patches by gcc 3.3 with -march=athon-xp gives about 17x.
ckjnigel
With the late November Lancer oggenc2 , I get MediaCoder encoding speeds from Flacs around 29x on my AMD 3300+ Win X64 system (q 6.16).
I know this thread is about speed, but I wonder if others disagree with my perception that quality now is comparable to MPC at rates around 200 kbps.
sh1leshk4
The quality at which -q setting...? unsure.gif
ckjnigel
QUOTE(sh1leshk4 @ Jan 1 2006, 10:56 AM)
The quality at which -q setting...? unsure.gif
*


Say, in the range of nominal 200 kbps, which is q 6.16 to 6.24.
I recall that MPC was considered near as dammit to transparent at q 8. So, I'm getting at whether there's a sweet spot in the latest Japanese tweaked Ogg Vorbis encoders in that 6 to 8 range.
I'm pretty sure that that glitch on the 6.0 boundary for the official release that made those just north sound much better has been solved...
vinnie97
I know that I've ditched mpc "insane" for ogg q7. The poor seeking and limited hardware support for mpc and the improvements in Vorbis are what convinced me.

I don't think Guru has tested beyond the 170 to 180 range yet, which showed ogg to be on par with (and in some cases better than) mpc.
HotshotGG
QUOTE
I don't think Guru has tested beyond the 170 to 180 range yet, which showed ogg to be on par with (and in some cases better than) mpc.


No, need to it's waste of time IMO. Most people with the exception of a few like GuruB can tell the difference, I can't. If it was low-bitrate test then sure why not biggrin.gif
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.