Attempt to improve lossless comparison tests

Topic: Attempt to improve lossless comparison tests (Read 4032 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Attempt to improve lossless comparison tests

2005-01-04 20:17:45

If you don't want to waste 15 minutes reading this post, here is the summary:

Lossless comparisons only consider a small number of songs due to time considerations
The compression ratio of any song can be estimated by a short sample of that song
Compressing short samples from a large number of songs might give a better estimate of a codecs performance than compressing a small number of complete songs
I've written a couple of programs to test this, along with the effectiveness of compressing tracks vs. albums
It could be better to compress a selection of individual songs rather than albums
It could be better still to compress munged files – many short chunks of different songs glued together – provided the effect on the compression ratio of ‘gluing’ the chunks together is known
Many more tests are probably needed to verify these things

Please note all of this is limited by my ability and knowledge, don't treat it (or me) as though I am claiming things as fact! Also, I apologise that it has the stuffy feel of a science report, I’m writing this as practice for [span style='font-size:8pt;line-height:100%'](and a way of putting off)[/span] my next report for school..

The Background
I've been working on a little pet project on and off for a couple of months, born out of curiosity on how to read & write wave files, and the dire need to improve my statistics knowledge. Since first seeing Hans's lossless comparison I wondered how accurate these lossless tests could be, as in how well do they reflect my taste in music, and any music in general.

Twice I've wanted to archive some music away never to be needed again, and so grabbed a copy of LA to save a few hundred megabytes. Both times [span style='font-size:8pt;line-height:100%'](on some home-made music, and Pink Flods Tree Full of Secrets)[/span] I found that APE's extra high mode beat LA's maximum mode. This probably isn't rare, but it raised the issue of how accurate Hans's test could be, and more importantly how confident people could be in the comparison.

The Problem
The problem as I see it is the narrow sample of songs chosen in this, and every other comparison out there. For example, if the album 'I Care Because You Do' was chosen in place of 'Richard D. James' wouldn't the overall ratios of the codecs be different in absolute value, and more importantly different relative to each other [span style='font-size:8pt;line-height:100%'](e.g. APE possibly at a lower ratio than LA)[/span].

The obvious solution of encoding more albums to gain a better representation has to be ruled out. I was staggered at the patience of Hans, Speek & Guru to compress and decompress 5gb+ of data with 30 different options. So instead I began to investigate the idea of fitting more songs into that 5gb to give a wider scope on the compression ability.

The Proposition
The majority of songs could be pictured as a few sections comprised of a different short segment repeated [span style='font-size:8pt;line-height:100%'](with slight variation)[/span] throughout. As an example, you could call the chorus of a song a 'section', and if you picked one bar out of it, you would have a short segment that represented the chorus. If you assume the compression ratio within each broad section of a song is pretty similar throughout, you could pick out a 5 second segment to represent the whole section.

I hope I don't lose people in the clumsy terminology there, the principal is that compressing a couple of 5 second chunks from any given song will hopefully give you a reasonable estimate of the compression ratio of that song. In this case the average song becomes 15 seconds long instead of 3 minutes – so you could get a reasonable approximation to 12 songs in the same space as the exact ratio for 1 song.

That is where the central question of this test and discussion lies; is the approximate ratio of a vast number of songs better than the exact ratio of a small selection of songs? By better, I mean closer to the exact ratio of the vast number of songs.

The Method
I started writing a program to review a large collection of songs [span style='font-size:8pt;line-height:100%'](my entire music collection)[/span] and 'munge' them. By this I mean it would pick 5 second samples from the songs at random - the choice of which song and the position within each song was made at random.

After selecting a random chunk, the program 'glues' it on to the end of the last one to create a single file with all the chunks in a long chain. In order to minimise the impact of this on the codecs predictor, it attempts to match the sample value and gradient across boundaries. I chose not to crossfade or add new samples to smooth the transition as this would introduce material that wasn't in the original songs. Instead it scans around the vicinity of the chunks starting point and chooses the starting position that best matches the ending point of the previous chunk. When listening to a munged file there are no clicks due to the cut’n’paste that is going on, all the chunks are joined quite smoothly.

This presented an argument against such a method that I cannot really counter or prevent. The culling of >90% of the material in individual songs can be justified as you can include 10x more songs to give a wider selection. The real problem is that individual chunks aren't going to be surrounded by their original material, so the predictor of an encoder is going to be 'in the wrong state' at the start of every chunk [span style='font-size:8pt;line-height:100%'](not in the same state as it would be in the actual song)[/span]. Whether the error introduced by this would negate any improvements was one subject I tried to investigate with some tests.

The Tests
The 'traditional' practice of picking a selection of songs needs to be tested against the proposed idea of munging the whole collection into a summary. The best method I thought of to test these was to look at the different values of the compression ratio a different selection of input material would make. I saw this as a way of answering the question 'what if Hans chose a different Aphex Twin album' posed earlier. The standard deviation of a group of 'equivalent choices' or 'equivalent files' was found as a measure of this spread.

For the munged files this was simple but time consuming. Running the program a number of times with the same parameters will choose different chunks within the vast collection, as the choice of which positions to take a chunk from are chosen at random. This would give a set of equivalent files, which differ only due to the random selections.

To compare this to the range of compression ratios found from picking individual songs, I wrote a second program to run through all the possible combinations of tracks and albums that could be made from a collection. [span style='font-size:8pt;line-height:100%'](e.g. If I have 5 songs, ABCDE, it will give the compression ratio of A, B, C, D, E, A+B, A+C, A+D .. A+B+C .. and so on)[/span]. I chose to also test the constraint of one album from each genre as done in Hans's test, as this seems the most logical way of choosing a selection of albums, rather than end up with 3 metal and no classical music.

The effect of having artificial joins in the munged file was also tested by creating files with different chunk lengths, ranging from 0.1s up to 5s or so.

The Results
The tests I've made so far have been preliminary, and thus are pretty haphazard. The first set were made with ~1500 songs (90 albums) totalling about 60gb, and encoded with APE 3.97 extra high. This was culled down to ~500 songs to save time and disk space, and the files were converted to APE 3.99 high. The broad genres I grouped them into were electronic, hip hop, metal, drum'n'bass & other (classical + film scores). My collection was far from idyllic when looking at the compression ratio for 'general music', but it provided a mix of 40%-80% tunes to highlight differences in the results. [span style='font-size:8pt;line-height:100%'](if all the tunes were 55-60% any discrepancies in methods would be tiny)[/span]

The results are in general looking at the accuracy and reproducibility of results. Accuracy being how close the chosen selection was to the true compression ratio - that of the entire collection as a whole. Reproducibility being how varied the result will be depending on what random choices are made.

For example, picking one song and compressing that could be very accurate if it happens to be 60%, or wildly inaccurate if it's a quiet classical piece compressing to 40% [span style='font-size:8pt;line-height:100%'](it is 20% too low)[/span]. If every possible choice is made [span style='font-size:8pt;line-height:100%'](each song is looked at individually in turn)[/span] the median compression ratio could be reasonably close to the true value [span style='font-size:8pt;line-height:100%'](if the songs are evenly spread about their mean, 100 above and 100 below)[/span] but the deviation will be huge, as you could be anywhere up to 20% out based on your choice.

I considered the 'efficiency' of a selection to be how good the result was compared the amount of data you had to compress to get it. If you can get a very close to the correct value with a small spread by only compressing 1% of the entire music collection your selection is very efficient, whereas compressing 5x as much data and getting the same results would be less efficient.

Test of standard methods
Possibly the most immediately useful results concern the methods currently employed in lossless tests. A comparison of a random selection of tracks, albums, and '1 album per genre' was tested. In contradiction to the method of current tests, grouping tracks together into their respective albums is less efficient than picking individual tracks at random. It was also noted that shaping the choice of albums by picking one of each genre gave a very different value for compression ratio, although this was expected for reasons described later.

This would imply that any semblance of grouping tracks together will harm the results of a test. I guess the rationale behind this is that by breaking the music collection into the smallest possible unit [span style='font-size:8pt;line-height:100%'](individual tracks rather than albums)[/span], a wider number of choices is gained [span style='font-size:8pt;line-height:100%'](50 tracks instead of 5 albums)[/span]. By randomly selecting the material to compress, a fair representation of the collection is gained, and by giving a larger scope for random selection the test is improved.

Some graphs could save me writing more, so these plot the compression ratio and standard deviation of the different selections [span style='font-size:8pt;line-height:100%'](tracks, albums, genres)[/span] against the amount of data that was compressed for each selection. e.g. Hans's test is based on 7 albums, so that would be roughly 4.5gb, Speek's is based on 10 albums (6.5gb), and a test based on 20 individual tracks would need something like 1gb.

[/span]

I'm sure that is all for now. Data, programs and source code can be made available to anyone interested. I should probably clean the profanity out of them first though, so ask if you want. Thanks for reading

Attempt to improve lossless comparison tests

Reply #1 – 2005-01-04 21:29:30

so i'm confused as to the focus of all this...

music/sound/noise is dynamic and will resulting data will obviously change as the input changes... so the goal of benchmark style teseting is to include an exhaustive amount of data as to make a resulting deviance minimal or not important.

within the context of benchmark testing, codecs must have a resulting value - i think you are confusing percentage (compression ratio) with a static value rather than a value generated using dynamic input.

also, why test a codec's ability to compress random/frankenstein noise samples when that codec will not be used for that purpose? current unofficial benchmarking goes above and beyond what anyone needs to know as far as performance - but feature comparison could be expanded upon to make a better educated decision.

btw, "ruling out" obvious solutions is a bad practice, imho

just my thoughts, anyways...

[edit:typo]

Attempt to improve lossless comparison tests

Reply #2 – 2005-01-05 14:26:58

Slightly off at a tangent: I'd like to see a comparison done on a PPC machine to see which codecs take advantage of Altivec.

Nice graphs by the way.

edit: Talking of which, it would be nice to see a version of Hans' graph that didn't treat compression + speed performance as a point, but as a cluster of points for each tested audio file to give an idea of which codecs where consistent and which had a wide range of performance characteristics.

Attempt to improve lossless comparison tests

Reply #3 – 2005-01-05 18:12:18

I think the "munging" idea is good for getting more representative numbers (as long as the slices aren't too short; at least 10 secs is probably OK as I don't think any of the codecs have long-term prediction). the munger is definitely a useful tool. but even random selection from a corpus will still reflect the overall corpus bias. the only way around this is to have as large and varied a corpus as possible.

but then you will only get an "average" number. the measures that will be more relevant to people will still be genre-based and format-based (by format-based I mean CD-DA (16/44.1), 24/96 audio, multichannel audio, etc). it is among formats and genres that the codecs tend to differ more.

Josh

Attempt to improve lossless comparison tests

Reply #4 – 2005-01-06 12:01:23

Very interesting idea! Could potentially be very useful for getting better test results faster.
I think it would also be interesting to see the difference between this method and selecting albums/tracks per genre.

As for faults in your reasoning/testing, the only thing I could find that definitely needs a bit of looking into is the uncertainty in the correction factor. Ideally it should be minimised as much as possible, perhaps by choosing longer sections, by modifying the way you join sections, or even changing the way a file is munged. One thing you may especially want to look into is the way you join sections, apparently you do try to make the transition not too abrupt, but you may want to make it even less abrupt, as the sudden change in content will be relatively hard to encode for most lossless compressors. And you shouldn't worry about introducing something that wasn't there before too much, as you're doing that already anyway (the transition), you just need to find a way to minimize the effect it has. Theoretically it might even be possible to simply generate a short piece of sound that just has properties similar to the original, but isn't even composed of actual sections from the original (based on frequency data for example).

I also have a question about your first graphs (showing the results for selecting tracks and albums). If I understand correctly you compressed quite a few tracks and then calculated what different combinations of those tracks would yield as a result, if this is the case I'd imagine that quite a lot of combinations would be possible (hundreds), probably all with slightly varying data sizes. So I suspect you grouped them together somehow, but my question is how did you group them and what was your reason for doing it that way? (Or am I overlooking something?) I'm not very good at statistics, but I imagine it could influence the results if, for example, groups with large data sizes had more members.

I'd definitely like to help you look into this further btw. For one thing, I have working code for using FLAC (and commandline encoders), as well as a tool to encode a file without actually writing the encoded file (it just keeps track of its size).

Attempt to improve lossless comparison tests

Reply #5 – 2005-01-07 16:56:01

Quote

music/sound/noise is dynamic and will resulting data will obviously change as the input changes... so the goal of benchmark style teseting is to include an exhaustive amount of data as to make a resulting deviance minimal or not important.

within the context of benchmark testing, codecs must have a resulting value - i think you are confusing percentage (compression ratio) with a static value rather than a value generated using dynamic input.
[a href="index.php?act=findpost&pid=263068"][{POST_SNAPBACK}][/a]

My argument is that by compressing a hundred 10 second samples you have a selection of music with lower deviance than by compressing five full songs. If we somehow had access to every hip hop song made and found that most would compress to between 50% and 56%, a comparison test that gave the single value of 53% would be good in my opinion. If a different test which used a small sample gave 55%, this isn't such a good representation of that genre.

Quote

also, why test a codec's ability to compress random/frankenstein noise samples when that codec will not be used for that purpose?
[a href="index.php?act=findpost&pid=263068"][{POST_SNAPBACK}][/a]

Frankenstien samples is a good way to describe them, but the aim is to make ones aren't just random noise, they should represent the actual music people compress very well.

Quote

the munger is definitely a useful tool. but even random selection from a corpus will still reflect the overall corpus bias. the only way around this is to have as large and varied a corpus as possible.

but then you will only get an "average" number. the measures that will be more relevant to people will still be genre-based and format-based ... it is among formats and genres that the codecs tend to differ more.
[a href="index.php?act=findpost&pid=263195"][{POST_SNAPBACK}][/a]

I would like to see a number of corpus' [span style='font-size:8pt;line-height:100%'](corpii?)[/span] to reflect the most common genres and produce the kind of test bawjaws mentions. Perhaps being able to collaberate with a number of people to create these based off several thousand songs would help create 'well rounded' files to base the test on.

Quote

As for faults in your reasoning/testing, the only thing I could find that definitely needs a bit of looking into is the uncertainty in the correction factor. Ideally it should be minimised as much as possible, perhaps by choosing longer sections, by modifying the way you join sections, or even changing the way a file is munged.
...
Theoretically it might even be possible to simply generate a short piece of sound that just has properties similar to the original, but isn't even composed of actual sections from the original (based on frequency data for example).
[a href="index.php?act=findpost&pid=263323"][{POST_SNAPBACK}][/a]

Crossfading samples was something I wanted to try next, it might help to ease the codec into the next sample and improve the error. One thing I have tested was to select all the random positions first, then sort them into 'ascending order' in time. If the collection is ordered into genre and then into artist, this would place all the chunks of the same artist together, and if there are two chunks from a given song, they will be next to eachother. This only has a noticable effect with short chunk length when there are 2 or more chunks per song on average.

I have my doubts about constructing artificial data to test codecs on, it stops being a 'real world' test if you don't even base the material on real music. Aside from that issue, I'd hate to think how complex creating a song that has representative patterns of all music in it!

Quote

I also have a question about your first graphs (showing the results for selecting tracks and albums). If I understand correctly you compressed quite a few tracks and then calculated what different combinations of those tracks would yield as a result, if this is the case I'd imagine that quite a lot of combinations would be possible (hundreds), probably all with slightly varying data sizes. So I suspect you grouped them together somehow, but my question is how did you group them and what was your reason for doing it that way? (Or am I overlooking something?) I'm not very good at statistics, but I imagine it could influence the results if, for example, groups with large data sizes had more members.
[a href="index.php?act=findpost&pid=263323"][{POST_SNAPBACK}][/a]

I think if you have a particular distribution of numbers [span style='font-size:8pt;line-height:100%'](the compression ratio of different songs)[/span] you should get the same value for the deviation no matter what sample size you use. With a bigger sample size you get a better estimation of this deviation, but the number shouldn't rise or fall in a systematic way as you increase the sample size - just fluctuate about a central point. Can anyone confirm or correct this?

The program creates a list of 500 compression ratios [span style='font-size:8pt;line-height:100%'](if I have 500 songs)[/span], so the sample size of '1 track combinations' is 500. It then pairs them up into combinations of 2 tracks, which it would choose every possible combination for, so every possible data size and compression ratio is found. Once you get to looking at combinations of 3 tracks, there are 20'000'000 ways, so it becomes far too slow to calculate every choice. If the sample size is going to be over a million, it doesn't look at every combination, but picks out combinations at random the way a lottery machine picks a combination of 6 numbered balls. It just picks a set of track id numbers, and again has no knowledge about the data size in a particular combination. I think it should end up as being fair, as it picks enough combinations [span style='font-size:8pt;line-height:100%'](500'000)[/span] to remove the effect of some combinations having an unusually small or large data size.

I hope that explains its operation better to you?

Attempt to improve lossless comparison tests

Reply #6 – 2005-01-07 22:52:27

Quote

I think if you have a particular distribution of numbers [span style='font-size:8pt;line-height:100%'](the compression ratio of different songs)[/span] you should get the same value for the deviation no matter what sample size you use. With a bigger sample size you get a better estimation of this deviation, but the number shouldn't rise or fall in a systematic way as you increase the sample size - just fluctuate about a central point. Can anyone confirm or correct this?

Sounds logical.

Quote

The program creates a list of 500 compression ratios [span style='font-size:8pt;line-height:100%'](if I have 500 songs)[/span], so the sample size of '1 track combinations' is 500. It then pairs them up into combinations of 2 tracks, which it would choose every possible combination for, so every possible data size and compression ratio is found. Once you get to looking at combinations of 3 tracks, there are 20'000'000 ways, so it becomes far too slow to calculate every choice. If the sample size is going to be over a million, it doesn't look at every combination, but picks out combinations at random the way a lottery machine picks a combination of 6 numbered balls. It just picks a set of track id numbers, and again has no knowledge about the data size in a particular combination. I think it should end up as being fair, as it picks enough combinations [span style='font-size:8pt;line-height:100%'](500'000)[/span] to remove the effect of some combinations having an unusually small or large data size.

I hope that explains its operation better to you?
[a href="index.php?act=findpost&pid=263626"][{POST_SNAPBACK}][/a]

Yes, that seems like a reasonable way to do it.

Notice