IPB

Welcome Guest ( Log In | Register )

2 Pages V  < 1 2  
Reply to this topicStart new topic
Personal Listening Test of Opus, Celt, AAC at 75-100kbps, ABC/HR blind test, 1 Listener
jmvalin
post Nov 22 2012, 02:45
Post #26


Xiph.org Speex developer


Group: Developer
Posts: 473
Joined: 21-August 02
Member No.: 3134



QUOTE (Kamedo2 @ Nov 20 2012, 23:37) *
I meant 'hard samples' by the definition of its real bitrate exceeds the value of the --bitrate option. In a real world, if we were to randomly choose 100 samples from the world, it is likely that roughly 50 samples < value of the --bitrate option < 50 samples. Yes, 1:1.


Well, I tend to define "hard samples" as those for which the real bitrate is much higher than the target. But even by your definition, there should be fewer than 50% hard samples. You can see the Opus VBR (simplified explanation) as reducing the rate of all samples by about 10%, and then looking for things that require an increase. You can see that in your results because with --bitrate 66, the lowest real rate is 60 while the highest is 101. So the distribution is highly asymmetric. What really matters is not exactly how many are above the target, but how many are in the "long tail" of outliers.

QUOTE (Kamedo2 @ Nov 20 2012, 23:37) *
As for "better compression vs better quality", those who want to calibrate the x-axis for the better representation of the world should also calibrate the y-axis. The calibration of the y-axis involves listening tests including many easy samples, and it's understandable why people don't want to do that, but on the average bitrate vs average score graph, calibrating only one axis would break the beautiful point of the graph, when the graph includes evaluation of CBR codecs like CELT, and make it less fair.


Sorry, I don't understand what you mean by y-axis calibration.
Go to the top of the page
+Quote Post
Kamedo2
post Nov 22 2012, 07:33
Post #27





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE (jmvalin @ Nov 22 2012, 10:45) *
QUOTE (Kamedo2 @ Nov 20 2012, 23:37) *
As for "better compression vs better quality", those who want to calibrate the x-axis for the better representation of the world should also calibrate the y-axis. The calibration of the y-axis involves listening tests including many easy samples, and it's understandable why people don't want to do that, but on the average bitrate vs average score graph, calibrating only one axis would break the beautiful point of the graph, when the graph includes evaluation of CBR codecs like CELT, and make it less fair.

Sorry, I don't understand what you mean by y-axis calibration.

Sorry for my bad explanation.


There are some people who complain that my set of samples focus on hard samples and there are not enough easy samples. Among these people, some people suggests, that I should encode a large set of normal musics and the average bitrate of the large collections should be used as x-axis on the average bitrate vs average quality graph. If you are not the 'some people', you can stop reading now. The reason those people advocate the use of large collections is that they contain more easy samples and more natural. They want to see more easy samples thrown in and want to see what happens to the bitrate. What they fail to notice, is that if I were to throw more easy samples, unavoidably, not only the bitrate but the quality will change too. On VBR, supposedly, it doesn't happen but often it does. Because they don't notice that, they try to adjust the x-axis but never touch the y-axis and move the plot only horizontally. But as more easy samples thrown in, the bitrate in x-axis decrease while the quality in y-axis go up naturally. So the only horizontal calibration some people advocate is unnatural.


Y-axis calibrations probably don't exist. Other than a lengthy process involving listening test of large set of easy samples.
Go to the top of the page
+Quote Post
jmvalin
post Nov 22 2012, 18:10
Post #28


Xiph.org Speex developer


Group: Developer
Posts: 473
Joined: 21-August 02
Member No.: 3134



QUOTE (Kamedo2 @ Nov 22 2012, 01:33) *
There are some people who complain that my set of samples focus on hard samples and there are not enough easy samples. Among these people, some people suggests, that I should encode a large set of normal musics and the average bitrate of the large collections should be used as x-axis on the average bitrate vs average quality graph. If you are not the 'some people', you can stop reading now.


Well, I'm among those who think that adjusting the x-axis that was isn't ideal, but that just using the rate average on the test samples as the x-axis. It avoids having to include a large amount of easy samples in the test.

QUOTE (Kamedo2 @ Nov 22 2012, 01:33) *
The reason those people advocate the use of large collections is that they contain more easy samples and more natural. They want to see more easy samples thrown in and want to see what happens to the bitrate. What they fail to notice, is that if I were to throw more easy samples, unavoidably, not only the bitrate but the quality will change too. On VBR, supposedly, it doesn't happen but often it does. Because they don't notice that, they try to adjust the x-axis but never touch the y-axis and move the plot only horizontally. But as more easy samples thrown in, the bitrate in x-axis decrease while the quality in y-axis go up naturally. So the only horizontal calibration some people advocate is unnatural.


Y-axis calibrations probably don't exist. Other than a lengthy process involving listening test of large set of easy samples.


I also don't know how you would do "y-axis calibration". The best I can think of here is to divide the test samples in two subsets: a hard subset like you currently have, and an "average" subset that represents normal samples you're likely to encounter in practice. You still use an even larger database for the x-axis, but now you give two quality values.
Go to the top of the page
+Quote Post
Dynamic
post Nov 22 2012, 19:03
Post #29





Group: Members
Posts: 793
Joined: 17-September 06
Member No.: 35307



I think the objectives in tests (experiments) matter. And the questions they aim to answer should determine the primary graphs produced. Scatter plots with actual bitrates in test samples are often not important to the question being asked but may be interesting secondary information.

I suspect for a lot of consumers, bitrate really amounts to a proxy for "How much music can I store on my device?" or "How much space can I save in my music to take photos on my smartphone?".

Instantaneous bitrate or even bitrate over a whole song isn't that important to them.

They may also ask: "What's the most music I can store on my device at a reasonably good quality?"

The meaning of reasonably good quality will vary.

For some, perceptible but not annoying differences may be tolerable much of the time, with just once or twice every few hours something mildly annoying and unmusical being noticed.
For others, entirely transparent most of the time, but once or twice every few hours, some difference that's perceptible but not annoying being noticed is their quality lower limit.

These are just two examples from the range of possible requirements.

For example, if testing 160 variants (e.g. 20 samples over 2 bitrates of 4 encoders or 10 samples over 4 bitrates of 4 encoders)

Question 1:
At the same bitrate over a general representative collection (thus being able to store the same amount of music on my device), which encoder offers the best quality?

(This is the question your tests are roughly set up to answer)

Question 2:
At the same bitrate over a general representative collection (thus being able to store the same amount of music on my device), what typical quality can we expect from each encoder (ignoring problem samples unless they account for 5% or more of general music).

(This is the question 'some people' seem to want answered)

Each question might warrant a different test:

Question 1 might warrant a large number of problem samples of various classes to make rare annoyances easy to detect and penalize the encoder that deals with problem samples worse. This might be best for people who wish to avoid the occasional annoying and unmusical artifacts.

Question 2 might warrant a collection of normal samples, not known to be problematic, to get a representative idea of typical quality, giving people an idea which bitrates might suit them. This might be best for people who are more forgiving of occasional artifacts, but would start to penalize any encoder frequently exposes artifacts rather than rarely. It might also be sensible to include the same 'typical' samples over a range of a few bitrates and limit the number of samples in the test corpus (e.g. make it 10 samples but test over 4 bitrates per encoder) to match this aim.

There are reasons of academic interest and limited practical use to the generic listener, where you'd really want to see a scatter plot to know either how variable VBR might be in each encoder, or to compare quality with actual bitrate to get an idea of bitrate-efficiency of the coding tools available in a particular format. Even then, the length of the sample may have a big influence on the reported bitrate, e.g. a 2 second period requiring lots of high bitrate short blocks within a 30 second clip of otherwise ordinary sound, will exhibit a lower bitrate than the same 2 seconds within a 5 second clip even though the instantaneous bit rate of the difficult part would be the same. Then again, there are some samples like fatboy that occur almost throughout the song (Kalifornia by Fatboy Slim).

In a sense it ought to be made clear to the average listener that the average bitrate line, not the scattered bitrate is what's important for the question of "How much music can I fit on my device" or "How much space will be left on my phone's microSD card for taking photos and videos?" and the y-axis height is the important point regarding quality, where the average quality may have some importance, the spread of quality values might also be important (less spread usually being good) and the lowest quality reported also having some importance (if you'll be rather unforgiving of a really nasty artifact, for example - in my case it's 'birdies' and 'warbling' in bad MP3 encoders that drive me crazy).

I think Question 1 is where you're leaning when saying that changing the type of sample used changes the quality (easy sample - higher quality) even for VBR.
I think Question 2 is the sort of listening test 'some people' have asked you for.

Where many of us at Hydrogen Audio differ from you is in our belief that average bitrate over a collection of CDs is the important bitrate calibration even for Question 1.

When making an experiment to test Question 1, we use hard samples to differentiate encoders more easily and find out which is best. At the same time, we realise that the reported Quality is more representative of rare problem cases only, and is therefore penalised, and is not representative of normal music, so we wouldn't claim to have useful information about general quality that can be expected.

Question 2 is actually quite hard to test, partly because the Quality is often very close to transparent (5.0) especially at around 96-100 kbps on recent encoders with easy samples.

I guess a useful graph for a single codec and mode (e.g. Opus in VBR mode) could be one that showed Quality (y) versus Average Bitrate (x) but plotted as two different lines at each bitrate.
The upper line (blue) would be general music quality (based on 5 to 20 samples of normal music of various genres). This might start rather low at 32kbps, increase at 48 kbps, get pretty high at 64 kbps and reach 5.0 at 96 kbps
The lower line (red) would be 'problem sample quality' where a collection of typical codec-killers is used (fatboy, tomsdiner, eig, etc.). This might start terribly low at 32kbps, still be pretty bad at 48, get a bit better and 64 kbps and reach somethiing like 4.0 at 96 kbps for example, and if extended to, say 128 kbps, it might get quite close to 5.0 for example.

The lines could also be fit lines, with quality scatter points above and below them to give an indication of the spread of quality for general samples (blue) and for problem samples (red).

The results of such a test might be relatively informative especially in terms of total space occupied by your music or total music duration per amount of storage space (e.g. expressed in hours per Gigabyte or Gigabytes per 10 hours) rather than focusing on bitrate.

Go to the top of the page
+Quote Post
Kamedo2
post Nov 22 2012, 22:42
Post #30





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE (Dynamic @ Nov 23 2012, 03:03) *
Question 1:
At the same bitrate over a general representative collection (thus being able to store the same amount of music on my device), which encoder offers the best quality?

(This is the question your tests are roughly set up to answer)

This is relatively easy to answer. If an occasional problem sample matters and want to avoid ugly artifacts, this test tells how much of these exceptionally bad moments exist.

QUOTE (Dynamic @ Nov 23 2012, 03:03) *
Question 2:
At the same bitrate over a general representative collection (thus being able to store the same amount of music on my device), what typical quality can we expect from each encoder (ignoring problem samples unless they account for 5% or more of general music).

(This is the question 'some people' seem to want answered)

The problem of using a general representative collection is I don't know the quality of the collection. I may know in the future, after 3, 4, or 5 month of listening tests, but not now. Don't expect me to know.
But, something similar can be done. I removed problem samples, namely, finalfantasy(harpsichord), FloorEssence(techno), VelvetRealm(techno, sharp attack), Tom's Dinar(Woman's a cappella).
I replotted the bitrate vs graph on the remaining 16 samples.



Hope it helps.
Go to the top of the page
+Quote Post
Kamedo2
post Nov 23 2012, 16:35
Post #31





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE
rjamorim: There's some inverse proportionality there
rjamorim: At low bitrates nobody is interested,
but the results are easy to obtain
rjamorim: At high bitrates everyone is interested,
but you practically can't obtain usable results/


The same can be said over hard samples and easy samples.

I uploaded all the samples in the Upload forum.
http://www.hydrogenaudio.org/forums/index....showtopic=98003
Go to the top of the page
+Quote Post
Kamedo2
post Nov 23 2012, 21:51
Post #32





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



I measured an average bitrate over wide range of normal 15 musics(63min).

opusenc --bitrate 66 input.wav output.opus
65.9kbps

celtenc input.48k.raw --bitrate 75 --comp 10 output.oga
73.4kbps

qaac --cvbr 72 -o output.m4a input.wav
75.2kbps

qaac --tvbr 27 -o output.m4a input.wav
66.0kbps

opusenc --bitrate 90 input.wav output.opus
88.9kbps

celtenc input.48k.raw --bitrate 100 --comp 10 output.oga
98.9kbps

qaac --cvbr 96 -o output.m4a input.wav
100.0kbps

qaac --tvbr 45 -o output.m4a input.wav
91.9kbps

The 15 songs I used is, of course, different from 20 samples I used in this listening test.
I understood the importance of calibration, but still I'm reluctant to use these value as the x-axis of the bitrate vs quality graph;

Imagine you make a x=height vs y=bodyweight scatter graph over many people, and you plot Alice's height as x and Bob's weight as y; That's a chimera.
When I plot Charlie, I use Charlie's height and Charlie's weight. When I plot Dave, I use Dave's height and Dave's weight.
Go to the top of the page
+Quote Post
jmvalin
post Nov 24 2012, 19:49
Post #33


Xiph.org Speex developer


Group: Developer
Posts: 473
Joined: 21-August 02
Member No.: 3134



QUOTE (Kamedo2 @ Nov 23 2012, 15:51) *
The 15 songs I used is, of course, different from 20 samples I used in this listening test.
I understood the importance of calibration, but still I'm reluctant to use these value as the x-axis of the bitrate vs quality graph;

Imagine you make a x=height vs y=bodyweight scatter graph over many people, and you plot Alice's height as x and Bob's weight as y; That's a chimera.
When I plot Charlie, I use Charlie's height and Charlie's weight. When I plot Dave, I use Dave's height and Dave's weight.


The method certainly has drawbacks, but it's not as silly as you might think. Here's another way to reason about it. I have a music player with a fixed capacity (e.g. 8 GB) and I need to fit my entire music collection on it. I compute the average rate I can afford and use that to encode all songs. Now imagine I wanted to figure out which codec to use. I can apply that process with many different codecs, and then compare the codecs to see which one provides me with the best music quality. Ideally, I'd listen to the each song, but that's way too long. The alternative is to pick only a relatively small sample. I *could* pick my samples at random, but in practice, I probably want to bias my selection towards harder samples (because not OK to have 90% good files with 10% awful ones), while still having "normal" samples too. So I do the listening test on those samples and pick the best codec. When I do that, I have used the quality of a few samples as the y axis, with the rate over a large sample as the x axis. And it makes sense to do so.
Go to the top of the page
+Quote Post
Kamedo2
post Nov 25 2012, 15:34
Post #34





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE (jmvalin @ Nov 25 2012, 03:49) *
QUOTE (Kamedo2 @ Nov 23 2012, 15:51) *
The 15 songs I used is, of course, different from 20 samples I used in this listening test.
I understood the importance of calibration, but still I'm reluctant to use these value as the x-axis of the bitrate vs quality graph;

Imagine you make a x=height vs y=bodyweight scatter graph over many people, and you plot Alice's height as x and Bob's weight as y; That's a chimera.
When I plot Charlie, I use Charlie's height and Charlie's weight. When I plot Dave, I use Dave's height and Dave's weight.


The method certainly has drawbacks, but it's not as silly as you might think. Here's another way to reason about it. I have a music player with a fixed capacity (e.g. 8 GB) and I need to fit my entire music collection on it. I compute the average rate I can afford and use that to encode all songs. Now imagine I wanted to figure out which codec to use. I can apply that process with many different codecs, and then compare the codecs to see which one provides me with the best music quality. Ideally, I'd listen to the each song, but that's way too long. The alternative is to pick only a relatively small sample. I *could* pick my samples at random, but in practice, I probably want to bias my selection towards harder samples (because not OK to have 90% good files with 10% awful ones), while still having "normal" samples too. So I do the listening test on those samples and pick the best codec. When I do that, I have used the quality of a few samples as the y axis, with the rate over a large sample as the x axis. And it makes sense to do so.


One of the big objective of this test is to compare Opus with Celt.
Do VBR-enabled Opus offer even better efficiency than Celt? Or did developers do some silly things and efficiency isn't improved at all?

Let's assume the latter. Suppose Celt and Opus are two ideal CBR and VBR encoders with exactly the same performance.
This ideal CBR encoder do not change the bitrate at all and harder songs occupy lower place on a bitrate-vs-quality plot.
This ideal VBR encoder do not change the quality at all and harder songs occupy right-side on the bitrate-vs-quality plot.
There are a normal music collection of 21 songs, each 4 minutes. Assume we have a 63MB of storage device, and we test which encoder offers better quality.
However, ABC/HRing the entire music collection is painstaking so we omit 8 easiest songs, and test only harder 13 songs.
We know only the 13 quality of harder songs, but we know 8 extra bitrate and we can use both 13 songs average and 21 songs average bitrate.

Which bitrate should we use?


Remember, these two encoders are of exactly the same performance.
Compare the CBR's consensus plot(red) with my plot(blue). They imply they are roughly the same performance.
Then, compare the red plot with a dark blue plot. The same implication is not there and the graph suggests superiority of VBR.
(Some may say it is superior because of it's smaller quality distribution; it's an ideal encoder and actual behavior may vary. Look at the errorbar of the actual Opus!)

I think the emphasis for bitrate calibration in hydrogenaudio is well deserved and I respect it. From now on, if I were to reproduce the same test, I'll
calibrate the bitrate on larger collection so that the bitrate is roughly the same. That's what people really want to know.

I'm going to use the bitrate of set of sample used only when the problem explained above can actually happen; the overemphasis on hard samples and use of CBR codecs.
Go to the top of the page
+Quote Post
Kamedo2
post Nov 25 2012, 21:35
Post #35





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



My post #34 might be too difficult. I wish I had better words to explain.

Let me review what this test really was. I used slightly-easier to hard samples and took the average of the 20 bitrates and 20 qualities.
There were no big measured difference in neither bitrate nor quality between Opus and Celt(green circle and orange circle).

However, this test overemphasizes hard samples, and people really want to see are true opus score and true celt score(blue and red circle).
True bitrate is relatively easy to obtain, but for true quality, additional effort is necessary. So, some may want to adjust the bitrate: the cheaper option,
and we get more close to real usage. Likely consequence is that we get somewhere close to True Opus score(blue circle) and measured Celt score(orange circle).
What we want is a blue and red circle, and what we get is a blue and orange circle. At least one score is what we want. It's an improvement from zero match to one match.
And we get the conclusion. Opus is better.

But hey, is that a fair comparison? That comparison ignores how celt encodes easy samples nicely, and on blue and red true score, I don't think there will be a big difference.
Go to the top of the page
+Quote Post
IgorC
post Nov 26 2012, 03:08
Post #36





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



Interesting.
The Opus'es scores have less deviation from an average score for particular items than CELT's. It means Opus has more constant quality.
Something hard to see looking only at an average scores.

In this moment I have some thoughts about it.
Imagine we have two encoders. Both produce the same average score but one of them produces the larger variation on results. Example, encoder A 4.5+/- 0.5 and encoder B 4.5+/-0.25.

For an experienced listener both have the same quality.
But the situation is different for another listener who already can't spot the difference while other give score >4.7 . So all samples of encoder B with score >4.7 will be ranked as transparent (5.0) in last case. So for the less experienced
listener (or the listener with inferior hardware) encoder B will have higher average score (and quality as well).

This post has been edited by IgorC: Nov 26 2012, 03:09
Go to the top of the page
+Quote Post
Dynamic
post Nov 26 2012, 21:57
Post #37





Group: Members
Posts: 793
Joined: 17-September 06
Member No.: 35307



Once again, Kamedo2, I applaud you for your testing skills and your willingness to adapt your method and present your data to answer different questions.

As I understand it, post 34 and post 35 are both hypothetical data, not real data, used to illustrate a point on different ways of interpreting the same data. Am I correct?
Go to the top of the page
+Quote Post
Kamedo2
post Nov 26 2012, 23:56
Post #38





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE (IgorC @ Nov 26 2012, 11:08) *
The Opus'es scores have less deviation from an average score for particular items than CELT's. It means Opus has more constant quality.


Sorry, I couldn't understand what you mean for 'particular items'. Yes, it's possible to believe Opus has slightly less quality deviation than CELT. 0.163<0.173 for 75k, 0.126<0.135 for 100k.
(Although it's not statistically significant, and I think the difference is almost negligible.)
The second and third paragraph is about nonlinearity I guess. If a encoder encodes first half of a music 4.0=Perceptible but not annoying and last half 2.0=annoying, I'll rate it 2.5 or so, not 3.0.
If an another encoder do the first half 3.6 and last half 3.4, my rate is 3.5. I have some thoughts about nonlinear average too.
Go to the top of the page
+Quote Post
Kamedo2
post Nov 27 2012, 00:14
Post #39





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE (Dynamic @ Nov 27 2012, 05:57) *
As I understand it, post 34 and post 35 are both hypothetical data, not real data, used to illustrate a point on different ways of interpreting the same data. Am I correct?


Thank you. Of course, because I don't know the scores of samples that are never tested, post 34-35 are hypothetical data based on how an ideal encoder supposed to work.
And although I strive to rate the same quality as same ratings at any case, my ratings are not THAT accurate as post 34-35.
Go to the top of the page
+Quote Post
jmvalin
post Jan 3 2013, 01:09
Post #40


Xiph.org Speex developer


Group: Developer
Posts: 473
Joined: 21-August 02
Member No.: 3134



Kamedo2, can you give 1.1-alpha a try? It includes post-exp_analysis changes that should address some issues exposed in your test. For example, samples 2 and 5 should have a lower rate, while samples 7 and 19 should have a higher rate.
Go to the top of the page
+Quote Post
DonP
post Jan 3 2013, 02:57
Post #41





Group: Members (Donating)
Posts: 1469
Joined: 11-February 03
From: Vermont
Member No.: 4955



QUOTE (IgorC @ Nov 25 2012, 21:08) *
In this moment I have some thoughts about it.
Imagine we have two encoders. Both produce the same average score but one of them produces the larger variation on results. Example, encoder A 4.5+/- 0.5 and encoder B 4.5+/-0.25.
....
For an experienced listener both have the same quality.
But the situation is different for another listener who already can't spot the difference while other give score >4.7 . So all samples of encoder B with score >4.7 will be ranked as transparent (5.0) in last case. So for the less experienced
listener (or the listener with inferior hardware) encoder B will have higher average score (and quality as well).


IMO if the encoder VBR is quality based, the one with more consistent quality deserves more credit anyway.
Go to the top of the page
+Quote Post
Kamedo2
post Jan 5 2013, 11:50
Post #42





Group: Members
Posts: 168
Joined: 16-November 12
From: Kyoto, Japan
Member No.: 104567



QUOTE (jmvalin @ Jan 3 2013, 09:09) *
Kamedo2, can you give 1.1-alpha a try? It includes post-exp_analysis changes that should address some issues exposed in your test. For example, samples 2 and 5 should have a lower rate, while samples 7 and 19 should have a higher rate.

I'm going to test MP3 224kbps ABC/HR first(very long test), so I'll have no time to test Opus. Sorry.
Go to the top of the page
+Quote Post

2 Pages V  < 1 2
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 25th April 2014 - 03:47