Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Public Listening Test [2010] (Read 176675 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Public Listening Test [2010]

Many people will be agree that it’s time for public listening test.

Short Agenda:
1.   Organization
2.   Samples
3.   Codecs
4.   Bitrates

Detailed Agenda:
1.   Organization.
People with experience of personal or public test are welcome here.
Personally I can afford help for conduction of test.
Unfortunately if I understand correctly that Sebastian informed that he don’t want to conduct public tests.

2. Samples.
Different styles of music, different levels of difficulty, pointing issues etc....?  To be discussed here or in separate topic.

3. Codecs
3.a)  Multiformat test.
3.b)  AAC test. 
I think that it’s more appropriate to conduct AAC test because there are at least 3 AAC encoders to test. Nero, Apple and Coding Technologies/Dolby. All these codecs were updated during this year. More information in List of AAC encoders

4.Bitrates.
96-100 kbits/s?
It's possible to perform test at 128 kbits/s with very hard samples. Aemese-like samples will be very easy to spot from original.


Sorry for my English.

All kind of thoughts and suggestions are welcomed here.

Public Listening Test [2010]

Reply #1
Codecs: I would love to see an AAC test including Nero 1.3.3.0 vs 1.5.1.0.

Samples: Has 24bit / >44kHz material ever been tested? With more and more webshops selling music in these resolutions and samplerates this might get interesting.


Public Listening Test [2010]

Reply #2
Definately AAC.  Nero v new Nero alone will bring in a lot of interest.  Add in Apple's latest, with it's true vbr options.  And it'd be interesting to see CT/Dobly in the mix since it's been left out of so many AAC discussions.  And I agree with you, it should be in the ~96k range.

Public Listening Test [2010]

Reply #3
I could imagine an AAC test at 80 or 96 kbps. At 80 kbps you could feature both LC and HE encoders and see which one performs best. The winner could be used in a multiformat listening test at the same bitrate.

Regarding the samples, muaddib once had an idea to create a samples DB that should be divided into problem samples and regular samples. When preparing a test, one could pick X samples from that DB based on lottery numbers so that people don't complain that sample Y was selected with the purpose of letting encoder A appear better than encoder B. Additionally, samples collected especially for the respective test could be used like it was done in the past.

Good luck with the test Igor. Are you planing to use existing software (ABC/HR for Java) for the test or something new? ABC/HR's development is dead unfortunately and there were some problems in my last test requiring the installation of JRE 1.5 (which some people with JRE 1.6 found annoying).

Public Listening Test [2010]

Reply #4
An AAC test would be interesting indeed. But at 96 kbps or so, you would have to test both AAC LC and HE-AAC, since they offer about the same average quality. With about 4 encoders under test (Apple, 2x nero, Dolby, ...), this would give 8 codecs under test. That's too many in my opinion (risk of overload and fatigue). How about 112 kbps or so? There you can be relatively sure that LC is better then HE on average, i.e. you would need to test only LC.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #5
Good luck with the test Igor.

That sounded sarcastic. 


Are you planing to use existing software (ABC/HR for Java) for the test or something new? ABC/HR's development is dead unfortunately and there were some problems in my last test requiring the installation of JRE 1.5 (which some people with JRE 1.6 found annoying).

Yes, I remember of this issue and speaking of me  I could get rid of this without problem however with some workaround. Must see what happens with other people.   

Definately AAC.  Nero v new Nero alone will bring in a lot of interest.  Add in Apple's latest, with it's true vbr options.  And it'd be interesting to see CT/Dobly in the mix since it's been left out of so many AAC discussions.  And I agree with you, it should be in the ~96k range.

True VBR mode isn't available in Windows version. Many people couldn't use it.

An AAC test would be interesting indeed. But at 96 kbps or so, you would have to test both AAC LC and HE-AAC, since they offer about the same average quality. With about 4 encoders under test (Apple, 2x nero, Dolby, ...), this would give 8 codecs under test. That's too many in my opinion (risk of overload and fatigue).
   
Itunes has HE-AAC only up to 80 kbps. Nero's defaults option goes to HE-AAC up to ~85 kbps.
From my personal test Apple LC-AAC is better than HE-AAC already at 80 kbps.
http://www.hydrogenaudio.org/forums/index....showtopic=74781
But yes it's personal test, not public.
96 kbps will be enough to test only LC-AAC codecs.
Anyways all bitrates should be shifted to 100 kbps as Itunes constrained VBR mode at 96 kbps produces real bitrates ~100 kbps. There is mediacoder which can shift bitrates for CT encoder.


How about 112 kbps or so? There you can be relatively sure that LC is better then HE on average, i.e. you would need to test only LC.
Chris
   
iTunes  hasn't 112 kbps. Only 96 and 128.

As I can see people interesting  in AAC test and next codecs:
1. Nero 1.3.3
2. Nero 1.5.1
3. CT
4. Apple
5. Divx (???) or other?

I think 4-5 codecs are already enough. Not more. What do you think?

I found that Divx AAC enocoder has all potential to get into test.
http://www.hydrogenaudio.org/forums/index....st&p=675456

Public Listening Test [2010]

Reply #6
Regarding the samples, muaddib once had an idea to create a samples DB that should be divided into problem samples and regular samples. When preparing a test, one could pick X samples from that DB based on lottery numbers so that people don't complain that sample Y was selected with the purpose of letting encoder A appear better than encoder B. Additionally, samples collected especially for the respective test could be used like it was done in the past.

I see. It's very important part to get fair results. It should be done this way.

Public Listening Test [2010]

Reply #7
~100 kbps listening test should be fine. Quality at 128 kbps was usually too high to get interesting results on a public test.
I'd rather have a multiformat test than a pure AAC one. Two tests would be ideal: an AAC one and then the multiformat one which would include the best AAC implementation. But it will give you much more annoyance. So go for either AAC or multiformat.

About Nero AAC: I don't see the point of testing the two last releases. Otherwise why wouldn't we test two different implementations of other competitors? If Nero developers have release this new encoder it's because it was tested, ready to use, and therefore solid and trustworthy.
Don't put too many competitors in the arena: the more encoders you have the harder is it to rate them accurately. At the end, many contenders will only bring statistic noise and the test will end with no clear winner. And don't forget the anchors: they're really essential to avoid or limit discrepancies.

Ideally, and for a public listening test, I would go for 2 competitors and 2 anchors. But it won't be very attractive to many people. So 3 competitors and 2 anchors is probably the most doable configuration.

Good luck.

Public Listening Test [2010]

Reply #8
I see. Didn't know that most encoders don't offer HE-AAC for bitrates higher than ~ 85 kpps.

As I can see people interesting  in AAC test and next codecs:

...

I think 4-5 codecs are already enough. Not more. What do you think?


Agreed. But for completeness sake: Fraunhofer is currently finalizing quality tunings on their encoder which have been going on for about two years. Release is scheduled for end of January. If there is any interest, I can ask if it's possible to provide an evaluation encoder for this test.

Chris
If I don't reply to your reply, it means I agree with you.


Public Listening Test [2010]

Reply #10
4.Bitrates.
96-100 kbits/s?
It's possible to perform test at 128 kbits/s with very hard samples. Aemese-like samples will be very easy to spot from original.


If this is already clear enough, then forgive me, but I think the test needs to have a very defined and limited goal with respect to bit rates.  Because when you start adding parameters and/or choosing optional items, things can get quite unclear what is equivalent or fair from one encoder to another encoder, especially if you are dealing with command-line vs gui.

Public Listening Test [2010]

Reply #11
True VBR mode isn't available in Windows version. Many people couldn't use it.


It's available in Quicktime Pro.

About Nero AAC: I don't see the point of testing the two last releases.


Now that you mention it, it doesn't really make sense.  Drop Nero 1.3.3 from the test.

So 3 competitors and 2 anchors is probably the most doable configuration.


Agreed.  Man of logic. 

Public Listening Test [2010]

Reply #12
About Nero AAC: I don't see the point of testing the two last releases. Otherwise why wouldn't we test two different implementations of other competitors? If Nero developers have release this new encoder it's because it was tested, ready to use, and therefore solid and trustworthy.

Yes and No.
There is no reason to compare two Nero versions with other encoders.
But many people use Nero encoder intensively here and want to know if there is any improvement (more specifically in quality area).
You also have compared two versions of Nero partially in your last test. I've compared two versions of Nero too in my previous test
But there are many encoders to test and it will be reasonable option to switch to last Nero encoder.
I will make a poll to see what people prefer.


Don't put too many competitors in the arena: the more encoders you have the harder is it to rate them accurately. At the end, many contenders will only bring statistic noise and the test will end with no clear winner. And don't forget the anchors: they're really essential to avoid or limit discrepancies.

Ideally, and for a public listening test, I would go for 2 competitors and 2 anchors. But it won't be very attractive to many people. So 3 competitors and 2 anchors is probably the most doable configuration.

I can't disagree here. Also Sebastian's public tests indicate that 3 (maybe 4. It should be discussed) competitors is a good balance. Even more taking into account that today AAC encoders provide enough good quality at 100 kbps and make it more difficult to ABX.



Possible high anchor:
1.LAME -V4 or -V3. I think -V5 is too risky to be a high anchor.
2.Nero or Apple 128 kbps

Possible low anchor:
In my opinion low anchor shouldn't be that bad.
LAME ~V7  (~100 kbps) or ABR 100.

Agreed. But for completeness sake: Fraunhofer is currently finalizing quality tunings on their encoder which have been going on for about two years. Release is scheduled for end of January. If there is any interest, I can ask if it's possible to provide an evaluation encoder for this test.

Chris

That would be really good.
As it was already mentioned we should limited to 3 competitors.

I propose to do two separate AAC tests during 2010.

Three more wide spread AAC encoders will be tested in 1st test which are:
1. Nero 1.5.1
2. Apple
3. CT
4?. Maybe Nero 1.3.3

And then the winner will be tested in 2d test:
1. Winner of 1st test
2. Divx (?)
3. Fraunofer (?)
or any other encoder with reasonable quality.

True VBR mode isn't available in Windows version. Many people couldn't use it.


It's available in Quicktime Pro.

It's not practical. Person must open each source file in QuickTime player then remux it MOV->M4A and will be tired after encoder only 1 album (>10 tracks).
I think we should stick with practical iTunes unless there will be any solution.

4.Bitrates.
96-100 kbits/s?
It's possible to perform test at 128 kbits/s with very hard samples. Aemese-like samples will be very easy to spot from original.


If this is already clear enough, then forgive me, but I think the test needs to have a very defined and limited goal with respect to bit rates.  Because when you start adding parameters and/or choosing optional items, things can get quite unclear what is equivalent or fair from one encoder to another encoder, especially if you are dealing with command-line vs gui.

Sorry, I don't understand where you go. Please, be more specific.



Public Listening Test [2010]

Reply #14
I would like to see Quicktime's new true VBR encoder on the test, even if it's a Mac OS X exlusive. Also having FAAC on the test as a low anchor would be interesting.
"I never thought I'd see this much candy in one mission!"

Public Listening Test [2010]

Reply #15
Please tell me if I'm on the wrong track.

How about a test to test the transparency of codecs?
I'd love to see another test and find out what the current VBR sweet spot is for LAME.
I suspect -V3 could replace -V2.

I miss the days where there was consensus over -V2 (I remember when --r3mix first came out to!)

Thanks

Public Listening Test [2010]

Reply #16
Don't put too many competitors in the arena: the more encoders you have the harder is it to rate them accurately. At the end, many contenders will only bring statistic noise and the test will end with no clear winner. And don't forget the anchors: they're really essential to avoid or limit discrepancies.

Ideally, and for a public listening test, I would go for 2 competitors and 2 anchors. But it won't be very attractive to many people. So 3 competitors and 2 anchors is probably the most doable configuration.

I can't disagree here. Also Sebastian's public tests indicate that 3 (maybe 4. It should be discussed) competitors is a good balance. Even more taking into account that today AAC encoders provide enough good quality at 100 kbps and make it more difficult to ABX.

Possible high anchor:
1.LAME -V4 or -V3. I think -V5 is too risky to be a high anchor.
2.Nero or Apple 128 kbps
Your and Guruboolez' own private tests have shown that AAC at 96k is likely to tie or even slightly outperform MP3 at 128k, which in itself has proven to be pretty much transparent in any recent 128k test, including last year's 128k MP3 test conducted by Sebastian.  The latter didn't include a high anchor either, since it's supposed to function as a clearly distinguisable reference point.  If you have AAC at 96k tested against LAME -V5 or even -V3, you risk having all of them rated 4.5 or so, making it hard, even impossible, to draw any interesting conclusions.  In others words, I believe codecs have evolved to such a point where, competing against 96k AAC, imho, picking a high anchor is moot.

The only valid reason I see for still including one, is to prove the almost self-fulfulling prophecy that MP3 is not the codec of choice in the 96k range, at least quality-wise.  Whether that makes it worth sacrificing participants' precious time and efforts, is debatable.

In fact, last year's 128k MP3 test didn't even give LAME the highest grade of the bunch, so having Helix as high anchor makes just as much sense to me.

Edit: then again, if we'd raise the bar even higher than already near-to-transparent LAME -V4 (or so), by simply throwing in the lossless original samples as high anchor, perhaps lowpassed at some 17 kHz, maybe that could keep the participants from handing out 4.5 averages to the 96k lossy samples.  This could set us up for a bold teaser: does 96k sound as good as the lossless originals?

Possible low anchor:
In my opinion low anchor shouldn't be that bad.
LAME ~V7  (~100 kbps) or ABR 100.
Agreed, and another thing your own tests have indicated: any MP3 codec at this bitrate, even LAME, will most probably wind up statistically worse by quite a margin, which makes it a very valid low anchor, even more so because it'll be competing at the same bitrate as its contenders.

I believe that, in a larger-scale, public test, we shouldn't be too convinced that at 96k, AAC is the poised winner over any other encoder.  Mind you that this bitrate range has been, well, untested whatsoever to a wide, public extent in recent years, as opposed to 128k, and 64k and below, both of which are potentially far off in terms of perceived audio quality.  Cf. http://wiki.hydrogenaudio.org/index.php?ti...Listening_Tests and especially Sebastian's and Roberto's accounts of the public tests they conducted.  Which is why I think we should put Aotuv Vorbis and WMA to the test too.  Chances are fair that it may end up as a statistical tie once again.

So I guess, in the end, my vote goes out to a multi-format test featuring 1 carefully selected AAC codec, Vorbis (Aotuv?) and WMA (Professional?), with LAME at the same bitrate as low anchor (edit) and lossless original as high anchor.

Public Listening Test [2010]

Reply #17
So I guess, in the end, my vote goes out to a multi-format test featuring 1 carefully selected AAC codec, Vorbis (Aotuv?) and WMA (Professional?), with LAME at the same bitrate as low anchor (edit) and lossless original as high anchor.

The decision of such selection should be based on the result of public test.
The problem is that it was very long time from last LC-AAC test. It's not the case of Vorbis where Aotuv is clearly optimal encoder. AAC test is must to be done before the Multiformat as many people here will be agree.


In others words, I believe codecs have evolved to such a point where, competing against 96k AAC, imho, picking a high anchor is moot.

Yep.
Apple 96 VBR was used as high anchor in previous public test and it did actually very well http://www.listening-tests.info/mf-48-1/results.htm
In my opinion correct me if I wrong but unless there is good reason to keep high anchor we shouldn't include it.

As I can see from poll inclusion of two versions of Nero is very meaningless.

Updated proposition of the test list:
1. Nero 1.5.1
2. Apple Itunes or QT(true VBR). To be discussed.
3. CT or Divx or FH . It's an option to do internal pre-test where participants will be some well known members of HA. Divx encoder has enough good VBR mode while CT only CBR.

Anchors:
High anchor. Do we really need it? I would propose Nero or Apple at 128 kbps. LAME -V4/3 is risky to get low scores for some hard samples.
Low anchor. LAME -V 7.

I would like to see Quicktime's new true VBR encoder on the test, even if it's a Mac OS X exlusive. Also having FAAC on the test as a low anchor would be interesting.

Low anchor should be perfectly inferior to all competitors. If FAAC has such quality then it can be included as low anchor.

I would follow suggestion of Guru as we also must take into account that AAC actually enough good at ~100 and it will be much work actually to ABX . Only 3 competitors.  Not more.

Public Listening Test [2010]

Reply #18
There is no reason to compare two Nero versions with other encoders.
But many people use Nero encoder intensively here and want to know if there is any improvement (more specifically in quality area).
You also have compared two versions of Nero partially in your last test. I've compared two versions of Nero too in my previous test
But there are many encoders to test and it will be reasonable option to switch to last Nero encoder.
I will make a poll to see what people prefer.

In the test you've linked two different implementations of Nero AAC were tested in a qualification pool - not in the final one. And if I did this, it was for a good reason: at this moment a very quick experience made me very suspicious about the output quality of the latest release of this encoder. And the test's results said my feelings were right (at least for me).

Currently nobody complaints about last Nero AAC release and nobody said the previous one was obviously better, so I really don't see the point of making a competition between the last year encoder and the 2009 one. Some people are probably interested by this kind of a comparison but they can easily do it themselves ; and if they find something weird then we might reconsider our initial choice. Myself I'm very interested between 1.0.7.0 (which really impressed me 2 years ago) and the last 1.5.1.0 but I don't see the point of making such comparison in a public listening test with direct reference to iTunes AAC, Fhg AAC and Ct AAC. Such comparison would immediately imply that we don't trust Nero AAC development cycle that much ; that we can safely use the last iTunes AAC without risk but not the last Nero AAC.


Quote
Possible low anchor:
In my opinion low anchor shouldn't be that bad.
LAME ~V7  (~100 kbps) or ABR 100.


It looks too strong for a low anchor. I'm also against dramatically poor low anchor but using one of the most advanced MP3 encoder at the same bitrate we'll test main contenders is a kind of a risk. But the comparison would be interesting I confess... For the sake of the test I'd rather lower the bitrate or use a very old (ISO AAC maybe?) encoder at 100 kbps. Or maybe half the bitrate with HE-AAC(v2)?

Public Listening Test [2010]

Reply #19
High anchor. Do we really need it?
Imho, only to avoid people from rating most (all?) competitors an unoriginal 4.5 score.  As said, that may almost only be achieved by contrasting them to the original, lossless samples.  But on the other hand, of course, that would make the test even more difficult to take.

Fwiw, as I promised last year, I am willing to host the test samples once more.  Plenty of bandwidth available.

Public Listening Test [2010]

Reply #20
Why not add ffmpeg aac encoder to this test?

Public Listening Test [2010]

Reply #21
Imho, only to avoid people from rating most (all?) competitors an unoriginal 4.5 score.  As said, that may almost only be achieved by contrasting them to the original, lossless samples.  But on the other hand, of course, that would make the test even more difficult to take.

Which is precisely why we need the hidden reference as a high anchor! If a listener assigns a grade of, say, 3.0 to the hidden lossless sample, we know that the listener is unable to identify the stimulus which must sound identical to the known original (and instead hears differences which are not there), so that listener must be post-screened, i.e. his/her results removed from statistical analysis.

At Fraunhofer, we use MUSHRA tests and post-screen listeners who assigned a grade lower than 90 (out of 100) to any hidden reference. If we are going to do an ABX test here, I propose the following:

- Low anchor: MP3 at 96 kbps CBR as I expect it to give slightly worse results than VBR.
- High anchor: Lossless original. Edit: no further high anchors to minimize listening time.
- Post-screening rules: Remove all listeners from analysis who
  a) graded the high anchor lower than 4.5,
  b) graded the low anchor higher than the high anchor.

What do you think?

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #22
I like it (with some reserves for low anchor).


Public Listening Test [2010]

Reply #24
It looks too strong for a low anchor. I'm also against dramatically poor low anchor but using one of the most advanced MP3 encoder at the same bitrate we'll test main contenders is a kind of a risk. But the comparison would be interesting I confess... For the sake of the test I'd rather lower the bitrate or use a very old (ISO AAC maybe?) encoder at 100 kbps. Or maybe half the bitrate with HE-AAC(v2)?

Last time I tried HE-AAC 48k was less or more comparable to MP3 96k.
iTunes or Nero LC-AAC 48-56k CBR could be optimal low anchor. Near 2x lower bitrate.

- High anchor: Lossless original. Edit: no further high anchors to minimize listening time.

Speaking of lossless original as high anchor did you mean:
1. There won't be high anchor at all
or
2. There will be supposedly lossy file but will be actually lossless reference.

- Post-screening rules: Remove all listeners from analysis who
  a) graded the high anchor lower than 4.5,
  b) graded the low anchor higher than the high anchor.

What do you think?
Chris

How about more strict rules?:

Remove all listeners from analysis who
  a) graded the high anchor lower than 4.8-4.9 or even 5 in case if high anchor will be lossless.
  b) graded the low anchor higher than any competitor. Low anchor by its definition will be perfectly inferior to any of competitors.


Should we test true or constricted VBR of Apple encoder?

Why not add ffmpeg aac encoder to this test?

I hadn't time to test it yet.  What stage of development it at? All encoders to test are actually stable release. Partial exclusion could be Divx AAC encoder (beta stage). But I will relate it more with rebranding Mainconcept->Divx as it's based on stable Mainconcept's code.