IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
Testing multi-core optimized encoders for next BonkEnc release
enzo
post Jun 29 2009, 20:31
Post #1


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



Hi, I am currently testing encoders built and optimized using the Intel Compiler 11.1. I plan to include them in BonkEnc 1.0.13.

ICL 11 supports automatic parallelization and thus allows encoders like LAME or Vorbis to make use of multi-core processors.

Note that this approach to multi-core support does not introduce additional quality reduction into the encoding process. Other approaches encode multiple frames in parallel, but have to disable certain encoder features in exchange (e.g. LAME-MT which disabled the bit reservoir). However, automatic parallelization does not scale as good as those other methods. Not all parts of the algorithms can be parallelized and thus we will not see four times faster encoding on a quad core with this one.

On my desktop system (Windows XP Pro x64, Phenom II X4 920) I still measured great performance improvements when ripping and converting. On a laptop (Windows Vista Home Premium, Athlon64 X2 QL-64) the improvement was not as impressive but still significant. Curiously, aoTuV was extremely slow on the Phenom and twice as fast on the actually inferior laptop. Here are the numbers (encoding time for a 44 min CD):
CODE
System: Windows XP Pro x64, AMD Phenom II X4 920

Input Output GCC ICL ICL (single core)
-------------------------------------------------
CD LAME 5:45 3:27 4:01
CD Vorbis 9:26 2:39 7:27
CD FLAC 2:25 2:22

WAV FLAC 1:18 0:41

CODE
System: Windows Vista Home Premium, AMD Athlon64 X2 QL-64

Input Output GCC ICL
-------------------------------------------------
CD LAME 5:33 5:26
CD Vorbis 4:48 3:50

WAV FLAC 1:30 1:21

I am using the following compiler options to build the encoders: /Q3 /Qparallel /Qipo /Qprof-use /arch:IA32 /QaxSSE4.1,SSSE3,SSE3,SSE2

You can find a preview release of BonkEnc 1.0.13 at bonkenc.org/updates/bonkenc-1.0.13-pre.zip. It includes ICL 11.1 compiles of LAME 3.98.3, FLAC 1.2.1, Vorbis aoTuV b5.7, FAAC 1.28 and Bonk 0.12.

It would be great to have some speed comparisons for other systems. So if you could compare BonkEnc 1.0.12 vs. this prerelease and post the numbers here, I would be very grateful.

What do you think about this idea in general? Are the compiler options safe or should I exclude any specific optimizations? Any other things you would like to mention?

Looking forward to read your suggestions and opinions on this.

Robert
Go to the top of the page
+Quote Post
saratoga
post Jun 29 2009, 21:50
Post #2





Group: Members
Posts: 4718
Joined: 2-September 02
Member No.: 3264



How much do these settings change the output of the encoders? Do you get the same speed up when the source is WAV?
Go to the top of the page
+Quote Post
Fandango
post Jun 30 2009, 00:16
Post #3





Group: Members
Posts: 1546
Joined: 13-August 03
Member No.: 8353



QUOTE (enzo @ Jun 29 2009, 21:31) *
It would be great to have some speed comparisons for other systems. So if you could compare BonkEnc 1.0.12 vs. this prerelease and post the numbers here, I would be very grateful.
...is not so good when using a different set of audio files.

Also, there's a dll missing: libiomp5md.dll.

General question: aren't there any license-free music tracks out there we could use and redistribute for testing purposes? Nine Inch Nails comes to my mind... but that's just one genre unless the many remixes are license free, too. Of course they should be checked for lossy compression artefacts first.

QUOTE (enzo @ Jun 29 2009, 21:31) *
What do you think about this idea in general?
It's great, I've been waiting for something like this since the day I got a quad core CPU.

QUOTE (enzo @ Jun 29 2009, 21:31) *
Are the compiler options safe or should I exclude any specific optimizations?
I don't know. laugh.gif

QUOTE (enzo @ Jun 29 2009, 21:31) *
Any other things you would like to mention?
What I think is important about MT transcoding: every systen is different and so is every codec implementation and each of their encoding/decoding settings.

It depends on all three things whether the CPU or the HDD is the bottleneck. Ideally we want the CPU to be the bottleneck, but a slow HDD can actually lead to less overall CPU usage when using just one more thread. For example, depending on the encoder settings of wavpack I can use either 3 or all 4 cores, if I don't use the extra switches I can only use 3 cores or else my HDD is turning into a sub-machine gun and the overall CPU usage drops below 75%. Well, that's how it was with my old drive.

The actual user has to know his/her system's bottlenecks, the coder of the transcoding app possibly can't know. And in order to get the most out of it for every user, i.e. the fastest batch transcode, there should be some automation to determine how many threads to use with the current codecs and their current settings.

Therefore an "easy" benchmark in an encoder suite would be nice. It could determine the overall bandwidth the current encoder settings need, and then let the user compare that to a simple I/O benchmark of his HDD, or the I/O output of 1 to n decoding threads using selected input files.


This post has been edited by Fandango: Jun 30 2009, 00:25
Go to the top of the page
+Quote Post
Fandango
post Jun 30 2009, 00:31
Post #4





Group: Members
Posts: 1546
Joined: 13-August 03
Member No.: 8353



QUOTE (Mike Giacomelli @ Jun 29 2009, 22:50) *
How much do these settings change the output of the encoders?

Output should be identical.

QUOTE (Mike Giacomelli @ Jun 29 2009, 22:50) *
Do you get the same speed up when the source is WAV?

Sure. Because this automated parallelization is only applied to the encoders it seems.

Uh, yeah. What about MT support in the decoders?

This post has been edited by Fandango: Jun 30 2009, 00:33
Go to the top of the page
+Quote Post
enzo
post Jun 30 2009, 08:36
Post #5


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



Thank you for your answers so far!

I noticed I did not mention the encoder settings I used. All test were made with BonkEnc's default settings so far.
LAME: Preset Standard
Vorbis: VBR, Q6.0
FLAC: -l 8 -r 3,3 -b 4608 -m -A "tukey(0.5)"

QUOTE (Fandango @ Jun 29 2009, 15:16) *
Also, there's a dll missing: libiomp5md.dll.
Just added it. Thanks for the hint!

QUOTE (Fandango @ Jun 29 2009, 15:16) *
General question: aren't there any license-free music tracks out there we could use and redistribute for testing purposes? Nine Inch Nails comes to my mind... but that's just one genre unless the many remixes are license free, too. Of course they should be checked for lossy compression artefacts first.
Yes, it's a good idea to find a common selection of music to test with first.

Nine Inch Nails' free album would be a good candidate I guess as it's available in CD quality FLAC. I also found Bach's Brandenburg Concertos also available in FLAC format.

Archive.org has an Open Source Audio Collection with many entries available in FLAC as well.

QUOTE (Fandango @ Jun 29 2009, 15:16) *
What I think is important about MT transcoding: every systen is different and so is every codec implementation and each of their encoding/decoding settings.

It depends on all three things whether the CPU or the HDD is the bottleneck. Ideally we want the CPU to be the bottleneck, but a slow HDD can actually lead to less overall CPU usage when using just one more thread. For example, depending on the encoder settings of wavpack I can use either 3 or all 4 cores, if I don't use the extra switches I can only use 3 cores or else my HDD is turning into a sub-machine gun and the overall CPU usage drops below 75%. Well, that's how it was with my old drive.
I guess you are using multiple instances of wavpack in parallel in this case, right? The HDD has to reposition the r/w-head a lot when encoding to multiple outputs at once, because the write operations are not sequential. With a parallelized version of wavpack you could probably use all 4 cores without any problems as the HDD writes would still be sequential.

QUOTE (Fandango @ Jun 29 2009, 15:16) *
The actual user has to know his/her system's bottlenecks, the coder of the transcoding app possibly can't know. And in order to get the most out of it for every user, i.e. the fastest batch transcode, there should be some automation to determine how many threads to use with the current codecs and their current settings.
I think that determining the optimal number of threads automatically under such circumstances where the HDD is the limiting factor would be very difficult. The best guess probably is to use 1 thread per CPU core in most cases. However, I will add an option to let the user choose how many threads to use. Some may prefer to use only one or two cores even on a quad core system to not have the CPU fan speed up all the time.

QUOTE (Mike Giacomelli @ Jun 29 2009, 12:50) *
How much do these settings change the output of the encoders?
In my tests, the output of parallel LAME and aoTuV was slightly smaller than the non-parallel output. FLAC files came out slightly bigger. I believe this is because of different floating point math commands/ordering used in the parallel builds. It is the same effect you get when comparing GCC to ICL builds or SSE optimized builds to plain IA32 builds. If that is the case, it's nothing to worry about. This will have to be further analyzed, though.

QUOTE (Mike Giacomelli @ Jun 29 2009, 12:50) *
Do you get the same speed up when the source is WAV?
Yes. In some cases the speedup is even greater as the CD ripping as a limiting factor drops out. For example, WAV to FLAC went from 1:18 to 0:41 for the 44 min album I tested with.

QUOTE (Fandango @ Jun 29 2009, 15:31) *
Uh, yeah. What about MT support in the decoders?
The decoders are parallelized as well. Except for the FAAD2 AAC decoder where the compiler was obviously unable to find parallelizable parts in the algorithm (CPU usage stayed at 25% during decoding) and the GCC build was actually faster than the ICL one.
Go to the top of the page
+Quote Post
enzo
post Jun 30 2009, 19:04
Post #6


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



I updated the preview release again and removed support for the vbr-old LAME presets. So if you test LAME encoding speed with BonkEnc 1.0.12 vs. 1.0.13 you need to select the "Standard, Fast" preset in 1.0.12 first.

I also tested non-parallel ICL builds vs. parallel builds and they are producing exactly the same output. Furthermore, parallel FAAC and Bonk encoders produce exactly the same output as the GCC compiled builds included with BonkEnc 1.0.12.
Go to the top of the page
+Quote Post
saratoga
post Jun 30 2009, 20:24
Post #7





Group: Members
Posts: 4718
Joined: 2-September 02
Member No.: 3264



QUOTE (Fandango @ Jun 29 2009, 19:31) *
QUOTE (Mike Giacomelli @ Jun 29 2009, 22:50) *
How much do these settings change the output of the encoders?

Output should be identical.


Yes I realize that we would like the same code to produce the same output regardless of how its compiled. Unfortunately we don't always get what we would like wink.gif

QUOTE
In my tests, the output of parallel LAME and aoTuV was slightly smaller than the non-parallel output. FLAC files came out slightly bigger. I believe this is because of different floating point math commands/ordering used in the parallel builds. It is the same effect you get when comparing GCC to ICL builds or SSE optimized builds to plain IA32 builds. If that is the case, it's nothing to worry about. This will have to be further analyzed, though.


Maybe do a quick decode to wav and compute the RMS error in the difference between the two tracks. As long as its small I wouldn't worry too much about differences in bitrate. of course, just because its large doesn't necessarily mean theres a problem either . . .
Go to the top of the page
+Quote Post
Fandango
post Jul 1 2009, 20:05
Post #8





Group: Members
Posts: 1546
Joined: 13-August 03
Member No.: 8353



I did a test on the Brandenburg Concertos from the Czech radio website.

My system:
Intel Core 2 Quad, 2.6GHz (Q6700), 8GB RAM, SATA2 Western Digital Caviar Black HDD and running Windows 7 64bit.

So for starters all 4 cores were fully used, BonkEnc used 98-99% CPU time. There was one thread smooth.dll and 3 of the MT dll. I guess that's how it's supposed to be.

But how do you get the encoding times? BonkEnc didn't write a log.
Go to the top of the page
+Quote Post
rpp3po
post Jul 1 2009, 20:24
Post #9





Group: Developer
Posts: 1126
Joined: 11-February 03
From: Germany
Member No.: 4961



I think you would get the best performance (without hurting quality) by concurrently encoding as many files as you have CPU cores. Nothing will beat that. So writing an intelligent scheduler, that optimizes disk I/O (large, just-in-time sequential instead of parallel reads), should be the best you could do. People usually only care about an encoder's performance for bulk conversion, so not showing any benefit for single file encodes wouldn't be a practical constraint most of the time.

This post has been edited by rpp3po: Jul 1 2009, 21:07
Go to the top of the page
+Quote Post
enzo
post Jul 1 2009, 21:53
Post #10


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



QUOTE (Mike Giacomelli @ Jun 30 2009, 12:24) *
Maybe do a quick decode to wav and compute the RMS error in the difference between the two tracks. As long as its small I wouldn't worry too much about differences in bitrate. of course, just because its large doesn't necessarily mean theres a problem either . . .
Ok, I will try that probably tomorrow.

QUOTE (Fandango @ Jul 1 2009, 12:05) *
But how do you get the encoding times? BonkEnc didn't write a log.
No, it does not support log writing, sorry. I stopped the time manually.

QUOTE (rpp3po @ Jul 1 2009, 12:24) *
I think you would get the best performance (without hurting quality) by concurrently encoding as many files as you have CPU cores. Nothing will beat that.
Yes, I agree and I am planning something like that for BonkEnc 1.1. However, that won't help much when you are ripping CDs.

I updated the preview release again with slightly faster encoder builds. Disabling multifile IPO (the /Qipo switch) made the DLLs smaller and the encoders slightly faster.
Go to the top of the page
+Quote Post
enzo
post Jul 2 2009, 15:13
Post #11


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



So, I did some RMSD calculations (the RW versions are the binaries from RareWares):
LAME-GCC vs. LAME-ICL: 1.41 (0.004%)
LAME-ICL vs. LAME-ICL-RW: 0.00
That looks just fine and leaves only aoTuV to worry about. The RMSD values for it are quite strange though:
aoTuV-GCC vs. aoTuV-ICL: 100.42 (0.31%)
aoTuV-ICL vs. aoTuV-ICL-RW: 100.51 (0.31%)
aoTuV-GCC vs. aoTuV-ICL-RW: 7.68 (0.02%)
I tried to find out what causes these differences and found that they are indeed triggered by the /Qparallel option. Compiled without /Qparallel it looks like this:
aoTuV-GCC vs. aoTuV-ICL: 4.24 (0.01%)
What's strange about this is that the RMSD stays high even if I manually disable parallelization of all effected loops with pragmas. It looks like /Qparallel is changing something else as well which is not mentioned in the documentation.

Compared to the original, the RMSD is about the same as for the GCC and RW builds though:
Original vs. aoTuV-GCC: 133.41 (0.41%)
Original vs. aoTuV-ICL: 133.33 (0.41%)
Original vs. aoTuV-ICL-RW: 133.40 (0.41%)
So I guess we cannot know without a listening test if quality is effected by this issue. Or what do you think?

Edit:

Found the cause of this problem and a work-around for it. The problem is in the loop initializing e->mdct_win[] in function _ve_envelope_init in envelope.c. Somehow the call to sin() seems to be optimized to use some faster version of it. As a work-around I disabled optimization for the whole function using pragmas. As this is just an init function, this work-around does not effect overall speed.

This now gives the following RMSD:
aoTuV-GCC vs. aoTuV-ICL: 5.48 (0.02%)
That should be ok.

Edit 2:

Updated the preview with the new aoTuV build.

This post has been edited by enzo: Jul 2 2009, 17:29
Go to the top of the page
+Quote Post
enzo
post Jul 7 2009, 21:32
Post #12


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



Here are my speed comparisons for the Brandenburg Concertos and the NIN album. I also tested the most recent builds from RareWares as they have always been the reference of encoding speed for me in the past.

The numbers are for my Phenom II X4 system.

Brandenburg Concertos - 89:47 min
CODE
Input Output GCC ICL-RW ICL
-------------------------------------
FLAC LAME 6:02 6:48 4:10
FLAC Vorbis 10:31 5:48 4:29
FLAC FAAC 6:41 7:37 3:37
FLAC FLAC 3:16 2:45 1:37
Nine Inch Nails - The Slip - 43:47 min
CODE
Input Output GCC ICL-RW ICL
-------------------------------------
FLAC LAME 3:19 3:12 1:57
FLAC Vorbis 5:24 4:08 2:12
FLAC FAAC 3:31 3:11 1:40
FLAC FLAC 1:54 1:15 0:47
I will do some more speed comparisons on my laptop and then release the final BonkEnc 1.0.13 with these encoders in a few days.
Go to the top of the page
+Quote Post
gottkaiser
post Jul 8 2009, 08:31
Post #13





Group: Members
Posts: 171
Joined: 7-January 05
From: Germany
Member No.: 18891



Hi Enzo,

will you release an updated development snapshot?
Go to the top of the page
+Quote Post
enzo
post Jul 8 2009, 12:58
Post #14


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



Yes, a new snapshot should be available next week. I will probably finish feature coding tomorrow, but need to do some additional testing and refactoring before the release.

Besides the optimized encoders, the snapshot will add WMA support and allow editing tags of existing files.
Go to the top of the page
+Quote Post
gottkaiser
post Jul 31 2009, 17:48
Post #15





Group: Members
Posts: 171
Joined: 7-January 05
From: Germany
Member No.: 18891



QUOTE (enzo @ Jul 8 2009, 12:58) *
Yes, a new snapshot should be available next week.

Anything new about the new snapshot?
Don't want to push you, Im just exited about the new multicore support.
Go to the top of the page
+Quote Post
enzo
post Aug 5 2009, 10:45
Post #16


BonkEnc developer


Group: Developer
Posts: 70
Joined: 17-January 03
From: Hamburg
Member No.: 4611



QUOTE (gottkaiser @ Jul 31 2009, 09:48) *
Anything new about the new snapshot?
Don't want to push you, Im just exited about the new multicore support.

Sorry, I had to delay the snapshot because of problems with the new tag editor. I do now plan to release it tomorrow.
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 19th April 2014 - 03:22