Help - Search - Members - Calendar
Full Version: Attempt to improve lossless comparison tests
Hydrogenaudio Forums > Lossless Audio Compression > Lossless / Other Codecs
Mac
If you don't want to waste 15 minutes reading this post, here is the summary:
  • Lossless comparisons only consider a small number of songs due to time considerations
  • The compression ratio of any song can be estimated by a short sample of that song
  • Compressing short samples from a large number of songs might give a better estimate of a codecs performance than compressing a small number of complete songs
  • I've written a couple of programs to test this, along with the effectiveness of compressing tracks vs. albums
  • It could be better to compress a selection of individual songs rather than albums
  • It could be better still to compress munged files – many short chunks of different songs glued together – provided the effect on the compression ratio of ‘gluing’ the chunks together is known
  • Many more tests are probably needed to verify these things
Please note all of this is limited by my ability and knowledge, don't treat it (or me) as though I am claiming things as fact! Also, I apologise that it has the stuffy feel of a science report, I’m writing this as practice for (and a way of putting off) my next report for school..



The Background
I've been working on a little pet project on and off for a couple of months, born out of curiosity on how to read & write wave files, and the dire need to improve my statistics knowledge. Since first seeing Hans's lossless comparison I wondered how accurate these lossless tests could be, as in how well do they reflect my taste in music, and any music in general.

Twice I've wanted to archive some music away never to be needed again, and so grabbed a copy of LA to save a few hundred megabytes. Both times (on some home-made music, and Pink Flods Tree Full of Secrets) I found that APE's extra high mode beat LA's maximum mode. This probably isn't rare, but it raised the issue of how accurate Hans's test could be, and more importantly how confident people could be in the comparison.


The Problem
The problem as I see it is the narrow sample of songs chosen in this, and every other comparison out there. For example, if the album 'I Care Because You Do' was chosen in place of 'Richard D. James' wouldn't the overall ratios of the codecs be different in absolute value, and more importantly different relative to each other (e.g. APE possibly at a lower ratio than LA).

The obvious solution of encoding more albums to gain a better representation has to be ruled out. I was staggered at the patience of Hans, Speek & Guru to compress and decompress 5gb+ of data with 30 different options. So instead I began to investigate the idea of fitting more songs into that 5gb to give a wider scope on the compression ability.


The Proposition
The majority of songs could be pictured as a few sections comprised of a different short segment repeated (with slight variation) throughout. As an example, you could call the chorus of a song a 'section', and if you picked one bar out of it, you would have a short segment that represented the chorus. If you assume the compression ratio within each broad section of a song is pretty similar throughout, you could pick out a 5 second segment to represent the whole section.

I hope I don't lose people in the clumsy terminology there, the principal is that compressing a couple of 5 second chunks from any given song will hopefully give you a reasonable estimate of the compression ratio of that song. In this case the average song becomes 15 seconds long instead of 3 minutes – so you could get a reasonable approximation to 12 songs in the same space as the exact ratio for 1 song.

That is where the central question of this test and discussion lies; is the approximate ratio of a vast number of songs better than the exact ratio of a small selection of songs? By better, I mean closer to the exact ratio of the vast number of songs.


The Method
I started writing a program to review a large collection of songs (my entire music collection) and 'munge' them. By this I mean it would pick 5 second samples from the songs at random - the choice of which song and the position within each song was made at random.

After selecting a random chunk, the program 'glues' it on to the end of the last one to create a single file with all the chunks in a long chain. In order to minimise the impact of this on the codecs predictor, it attempts to match the sample value and gradient across boundaries. I chose not to crossfade or add new samples to smooth the transition as this would introduce material that wasn't in the original songs. Instead it scans around the vicinity of the chunks starting point and chooses the starting position that best matches the ending point of the previous chunk. When listening to a munged file there are no clicks due to the cut’n’paste that is going on, all the chunks are joined quite smoothly.

This presented an argument against such a method that I cannot really counter or prevent. The culling of >90% of the material in individual songs can be justified as you can include 10x more songs to give a wider selection. The real problem is that individual chunks aren't going to be surrounded by their original material, so the predictor of an encoder is going to be 'in the wrong state' at the start of every chunk (not in the same state as it would be in the actual song). Whether the error introduced by this would negate any improvements was one subject I tried to investigate with some tests.


The Tests
The 'traditional' practice of picking a selection of songs needs to be tested against the proposed idea of munging the whole collection into a summary. The best method I thought of to test these was to look at the different values of the compression ratio a different selection of input material would make. I saw this as a way of answering the question 'what if Hans chose a different Aphex Twin album' posed earlier. The standard deviation of a group of 'equivalent choices' or 'equivalent files' was found as a measure of this spread.

For the munged files this was simple but time consuming. Running the program a number of times with the same parameters will choose different chunks within the vast collection, as the choice of which positions to take a chunk from are chosen at random. This would give a set of equivalent files, which differ only due to the random selections.

To compare this to the range of compression ratios found from picking individual songs, I wrote a second program to run through all the possible combinations of tracks and albums that could be made from a collection. (e.g. If I have 5 songs, ABCDE, it will give the compression ratio of A, B, C, D, E, A+B, A+C, A+D .. A+B+C .. and so on). I chose to also test the constraint of one album from each genre as done in Hans's test, as this seems the most logical way of choosing a selection of albums, rather than end up with 3 metal and no classical music.

The effect of having artificial joins in the munged file was also tested by creating files with different chunk lengths, ranging from 0.1s up to 5s or so.


The Results
The tests I've made so far have been preliminary, and thus are pretty haphazard. The first set were made with ~1500 songs (90 albums) totalling about 60gb, and encoded with APE 3.97 extra high. This was culled down to ~500 songs to save time and disk space, and the files were converted to APE 3.99 high. The broad genres I grouped them into were electronic, hip hop, metal, drum'n'bass & other (classical + film scores). My collection was far from idyllic when looking at the compression ratio for 'general music', but it provided a mix of 40%-80% tunes to highlight differences in the results. (if all the tunes were 55-60% any discrepancies in methods would be tiny)

The results are in general looking at the accuracy and reproducibility of results. Accuracy being how close the chosen selection was to the true compression ratio - that of the entire collection as a whole. Reproducibility being how varied the result will be depending on what random choices are made.

For example, picking one song and compressing that could be very accurate if it happens to be 60%, or wildly inaccurate if it's a quiet classical piece compressing to 40% (it is 20% too low). If every possible choice is made (each song is looked at individually in turn) the median compression ratio could be reasonably close to the true value (if the songs are evenly spread about their mean, 100 above and 100 below) but the deviation will be huge, as you could be anywhere up to 20% out based on your choice.

I considered the 'efficiency' of a selection to be how good the result was compared the amount of data you had to compress to get it. If you can get a very close to the correct value with a small spread by only compressing 1% of the entire music collection your selection is very efficient, whereas compressing 5x as much data and getting the same results would be less efficient.


Test of standard methods
Possibly the most immediately useful results concern the methods currently employed in lossless tests. A comparison of a random selection of tracks, albums, and '1 album per genre' was tested. In contradiction to the method of current tests, grouping tracks together into their respective albums is less efficient than picking individual tracks at random. It was also noted that shaping the choice of albums by picking one of each genre gave a very different value for compression ratio, although this was expected for reasons described later.

This would imply that any semblance of grouping tracks together will harm the results of a test. I guess the rationale behind this is that by breaking the music collection into the smallest possible unit (individual tracks rather than albums), a wider number of choices is gained (50 tracks instead of 5 albums). By randomly selecting the material to compress, a fair representation of the collection is gained, and by giving a larger scope for random selection the test is improved.

Some graphs could save me writing more, so these plot the compression ratio and standard deviation of the different selections (tracks, albums, genres) against the amount of data that was compressed for each selection. e.g. Hans's test is based on 7 albums, so that would be roughly 4.5gb, Speek's is based on 10 albums (6.5gb), and a test based on 20 individual tracks would need something like 1gb.

user posted image
The standard deviation for a number of tracks is notably smaller than that for the similar sized number of albums; e.g. 120 tracks (4.85gb) gave 60.6% ± 1.0%, 7 albums (4.86gb) gave 60.5% ± 3.4%.

My interpretation of this is that in Hans’ test, if APE extra high is at 59% and LA extra high at 57%, it isn’t entirely unreasonable that a different selection of albums would result in APE getting 56% and LA 60% - or APE 62% and LA 54%. If 120 tracks were compressed instead of 7 albums, whatever values APE and LA took would would likely be closer to their general averages over all music.

user posted image
With a finer scale, you can see that one album per genre is seriously off the mark, and that 1gb of individual songs are compressed on average closer to the true value than the combination of 1gb worth of albums. Neither of these points are particularly relevant as the standard deviation in all values is so large that they all cross the line of true compression ratio.

The first point would indicate that it is better to choose albums completely at random, rather than place a constraint of representing each genre by one album. This merely replaces your choice in choosing a balanced selection of albums with having to create a balanced collection in the first place. A random selection of albums from my collection of drum'n'bass and metal will mean nothing to someone wanting to compress classical, jazz, or wonderous 80's music.


The 90 albums in APE 397 show a very similar pattern, the standard deviation of songs is a little under half that of the same amount (in terms of data) of albums. 410 songs (15.3gb) gave 59.9% ± 0.8%, 25 albums (15.4gb) gave 59.9% ± 1.8%. The tracks did not converge to the correct value faster than albums as with the 399 test, and it was seen that with samples of over 10gb they gave almost identical ratios following the same trend.


Test of munged files
Once all the known problems with the munging program had been ironed out, it was used for some basic checks to see if things performed as should be expected. Six groups of ‘equivalent’ files were created, all containing 0.25s chunks (to speed up the execution), but the number of chunks in each file altered. The first group contained 320 files with 100 chunks in each, the second group had 160 files with 200 chunks in each and so on, until the last group containing 10 files with 3200 chunks. As I had expected, the standard deviation of each of these groups went as root(N) / N, where N is the number of chunks in each file. The average compression of each group of files was the same, ranging from 61.5% ± 1.2% up to 61.7% ± 0.2%.

Two things were quite evident at this early stage – the compression ratio was way off target, but the deviance in results was very small. The 100 chunk files lasted only 25 seconds, yet had a deviance of just 1.2%. In comparison, by choosing a single song lasting maybe 3 minutes, one would expect a deviance of 9.7%. The 3200 chunk files were around 100mb each, and with a deviance of just 0.2% showed a greater uniformity than combinations of 500 songs (0.7%) which required over 18gb to be compressed. This was clearly offset by the error in the compression ratio, which was over 1% too high, and so only the 100 chunk files could be argued to be of the correct ratio given their standard deviation.
user posted image


This error in ratio was tested more fully in the hope that the ratio could be corrected for any given test. Without some way of correcting for the error – or constructing files which do not induce an error – the results would be meaningless due to the large discrepancy between the values returned and the true ratio.

Several groups of files were produced with chunk times varying from 50ms up to 6s over the past few weeks. These contained files from both the 397 extra high collection and the smaller 399 high group. The reason behind combining the two groups was in hope of producing a universal formula for correcting the ratio – rather than requiring a specific correction for each compression mode used.

The first test was with the large 397 collection, where groups of 20 files were produced, each containing 0.1% (56mb) of the entire collection. The error in the compression ratio (average ratio of the group minus the true ratio of the whole collection) was found along with the deviation within the group. To be consistent with the 2nd test, the error in the compression ratio was represented as a fraction of the true ratio, so a 1% error would mean the ratio ended up 101% of what it should be, or 60.6% instead of 60% in absolute terms. This was hoped to be the method that different compression settings (and perhaps different encoders) could be compared.

The graph below shows this fractional error in the ratio against the chunk length, the best fit line is proportional to sqrt(t)/t, where t is the chunk time on the x axis.
user posted image


The biggest problem with this graph was the uncertainty in the long chunk times. The deviation in the files was dependant on the number of chunks present, and so got larger with the square root of chunk time. To help counter this, the second test with the 399 group produced files with a constant number of chunks (2000), rather than a constant length. This posed a different problem, the time and space required to create these files. It was extended to produce chunks up to 6s in length to test the longer term behaviour, 2000 of these chunks would form a 3h20 track. 10 copies of each file were produced to form a group.
user posted image


The coefficient of the two fits was similar (9.5 / 1000sqrt(t) and 8.8 / 1000sqrt(t) – both ±2ish) so they were tested as a combined group. The graph below summarises the compression ratio & deviance of these two tests (red,blue) and a 3rd with the 399 set (green). It is a bit of a mess as there is a lot of data, much of which overlaps. For example, 1 second chunks were tested both on APE 397 and 399, and a number of times in a subsequent test. This gives several data points lying in the same position.

user posted image

The 3 data sets appear to converge to a common line given their quite large errors. The power of the fit was allowed to be varied for the fit, and settled on -0.496, implying that sqrt(t)/t was the likely to be the correct assumption.

When plotted against t^-0.5 rather than t:
user posted image

From the combined best fit, the coefficient giving the error in compression ratio (x in ratio * x / 1000sqrt(t)) is 8.3 ± 0.8. The uncertainty in the value is due to the uncertainty on each group of data points.



The final test was to create and compress a set of munge files, predict the error in them and use the corrected values to estimate the compression ratio of the entire collection – analogous to the track & album tests earlier.

Sets of 10 files were created containing 0.5%, 1.0%, 1.5% and 2.0% of the collection, with chunk times of 0.1s and 1.0s – giving 80 files in total. Each set was grouped and the standard deviation found, the correction to the ratio applied and then they were plotted as before

The light blue points show the uncorrected ratio, with the darker points showing the corrected value with the standard deviation. For 1 second chunks:
user posted image

And 0.1 second chunks:
user posted image

It is worthwhile scanning back to the graphs of track & album comparisons. Up to about 2gb of data, the standard deviations were around ± 5%. The deviation doesn’t go above 0.4% for 1 second chunks, nor above 0.2% for 0.1 second chunks – even when the munged files contain less than 100mb of data.

One notable omission here is neglecting to include the uncertainty in the correction factor. If my differentiation is up to scratch, this ended up giving the standard deviation as 5% for 1 second, and 18% for 0.1 second chunks! I don’t trust this result particularly, both for the large error in the correction factor and the way this converts into error on corrected ratio. Either way I would expect the error on the correction factor to improve significantly when more tests were made.

I think this method of munging files could be promising, although the final test was very brief, the correct values were obtained. I would like to think it is worth testing further.


The Discussion
So, what next? This rests on the general answer to three questions I am left with.
  • Is anyone actually interested in this?
  • Have I made any horrific errors, either in my code or analysis?
  • Is anybody willing to help continue this research?
This could come across as a completely esoteric and useless concept, I merely approached it out of a strange curiosity. I can see a use for it, albeit an incredibly narrow and limited one, hence it is hard to think of this as important or ground-breaking!

One potential benefit I considered could come from the use of these munge files; the ability to pool individual music collections together to produce a better ‘representative collection’ based on several peoples tastes. If I munged my music collection with the same settings (% coverage & chunk time) as someone else, the two munged files could be joined together. I believe the result would be the same as if the two collections were bought together then munged. Munged files should be perfectly legal to pass around, being as no more than a fraction of a second of any song is present at any point!


If any level of interest is shown by people here, I be interested to see anywhere I have fouled up. Should other people ever want to use the programs I've written I could sorely use some assistance in that area. I couldn't get the FLAC API working for the life of me, so the munging app will only play with wave files, meaning all the compression must be done manually.


Again if there is interest, I wouldn’t mind testing everything further. The main aim I see would be to test more situations – different musical input and different codecs. If the correction factor proves the same for FLAC and Wavpack as it is for APE, I would be impressed. Also, just some discussion about this would be interesting, to see what people think of this (other than the possiblity I have gone insane) smile.gif




I'm sure that is all for now. Data, programs and source code can be made available to anyone interested. I should probably clean the profanity out of them first though, so ask if you want. Thanks for reading smile.gif
xmixahlx
so i'm confused as to the focus of all this...

music/sound/noise is dynamic and will resulting data will obviously change as the input changes... so the goal of benchmark style teseting is to include an exhaustive amount of data as to make a resulting deviance minimal or not important.

within the context of benchmark testing, codecs must have a resulting value - i think you are confusing percentage (compression ratio) with a static value rather than a value generated using dynamic input.

also, why test a codec's ability to compress random/frankenstein noise samples when that codec will not be used for that purpose? current unofficial benchmarking goes above and beyond what anyone needs to know as far as performance - but feature comparison could be expanded upon to make a better educated decision.

btw, "ruling out" obvious solutions is a bad practice, imho tongue.gif


just my thoughts, anyways...

[edit:typo]
bawjaws
Slightly off at a tangent: I'd like to see a comparison done on a PPC machine to see which codecs take advantage of Altivec.

Nice graphs by the way.

edit: Talking of which, it would be nice to see a version of Hans' graph that didn't treat compression + speed performance as a point, but as a cluster of points for each tested audio file to give an idea of which codecs where consistent and which had a wide range of performance characteristics.
jcoalson
I think the "munging" idea is good for getting more representative numbers (as long as the slices aren't too short; at least 10 secs is probably OK as I don't think any of the codecs have long-term prediction). the munger is definitely a useful tool. but even random selection from a corpus will still reflect the overall corpus bias. the only way around this is to have as large and varied a corpus as possible.

but then you will only get an "average" number. the measures that will be more relevant to people will still be genre-based and format-based (by format-based I mean CD-DA (16/44.1), 24/96 audio, multichannel audio, etc). it is among formats and genres that the codecs tend to differ more.

Josh
Jasper
Very interesting idea! Could potentially be very useful for getting better test results faster.
I think it would also be interesting to see the difference between this method and selecting albums/tracks per genre.

As for faults in your reasoning/testing, the only thing I could find that definitely needs a bit of looking into is the uncertainty in the correction factor. Ideally it should be minimised as much as possible, perhaps by choosing longer sections, by modifying the way you join sections, or even changing the way a file is munged. One thing you may especially want to look into is the way you join sections, apparently you do try to make the transition not too abrupt, but you may want to make it even less abrupt, as the sudden change in content will be relatively hard to encode for most lossless compressors. And you shouldn't worry about introducing something that wasn't there before too much, as you're doing that already anyway (the transition), you just need to find a way to minimize the effect it has. Theoretically it might even be possible to simply generate a short piece of sound that just has properties similar to the original, but isn't even composed of actual sections from the original (based on frequency data for example).

I also have a question about your first graphs (showing the results for selecting tracks and albums). If I understand correctly you compressed quite a few tracks and then calculated what different combinations of those tracks would yield as a result, if this is the case I'd imagine that quite a lot of combinations would be possible (hundreds), probably all with slightly varying data sizes. So I suspect you grouped them together somehow, but my question is how did you group them and what was your reason for doing it that way? (Or am I overlooking something?) I'm not very good at statistics, but I imagine it could influence the results if, for example, groups with large data sizes had more members.

I'd definitely like to help you look into this further btw. For one thing, I have working code for using FLAC (and commandline encoders), as well as a tool to encode a file without actually writing the encoded file (it just keeps track of its size).
Mac
QUOTE
music/sound/noise is dynamic and will resulting data will obviously change as the input changes... so the goal of benchmark style teseting is to include an exhaustive amount of data as to make a resulting deviance minimal or not important.

within the context of benchmark testing, codecs must have a resulting value - i think you are confusing percentage (compression ratio) with a static value rather than a value generated using dynamic input.
*

My argument is that by compressing a hundred 10 second samples you have a selection of music with lower deviance than by compressing five full songs. If we somehow had access to every hip hop song made and found that most would compress to between 50% and 56%, a comparison test that gave the single value of 53% would be good in my opinion. If a different test which used a small sample gave 55%, this isn't such a good representation of that genre.

QUOTE(xmixahlx @ Jan 4 2005, 09:29 PM)
also, why test a codec's ability to compress random/frankenstein noise samples when that codec will not be used for that purpose?
*

Frankenstien samples is a good way to describe them, but the aim is to make ones aren't just random noise, they should represent the actual music people compress very well.


QUOTE(jcoalson @ Jan 5 2005, 06:12 PM)
the munger is definitely a useful tool. but even random selection from a corpus will still reflect the overall corpus bias. the only way around this is to have as large and varied a corpus as possible.

but then you will only get an "average" number.  the measures that will be more relevant to people will still be genre-based and format-based ...  it is among formats and genres that the codecs tend to differ more.
*

I would like to see a number of corpus' (corpii?) to reflect the most common genres and produce the kind of test bawjaws mentions. Perhaps being able to collaberate with a number of people to create these based off several thousand songs would help create 'well rounded' files to base the test on.


QUOTE(Jasper @ Jan 6 2005, 12:01 PM)
As for faults in your reasoning/testing, the only thing I could find that definitely needs a bit of looking into is the uncertainty in the correction factor. Ideally it should be minimised as much as possible, perhaps by choosing longer sections, by modifying the way you join sections, or even changing the way a file is munged.
...
Theoretically it might even be possible to simply generate a short piece of sound that just has properties similar to the original, but isn't even composed of actual sections from the original (based on frequency data for example).
*

Crossfading samples was something I wanted to try next, it might help to ease the codec into the next sample and improve the error. One thing I have tested was to select all the random positions first, then sort them into 'ascending order' in time. If the collection is ordered into genre and then into artist, this would place all the chunks of the same artist together, and if there are two chunks from a given song, they will be next to eachother. This only has a noticable effect with short chunk length when there are 2 or more chunks per song on average.

I have my doubts about constructing artificial data to test codecs on, it stops being a 'real world' test if you don't even base the material on real music. Aside from that issue, I'd hate to think how complex creating a song that has representative patterns of all music in it!


QUOTE(Jasper @ Jan 6 2005, 12:01 PM)
I also have a question about your first graphs (showing the results for selecting tracks and albums). If I understand correctly you compressed quite a few tracks and then calculated what different combinations of those tracks would yield as a result, if this is the case I'd imagine that quite a lot of combinations would be possible (hundreds), probably all with slightly varying data sizes. So I suspect you grouped them together somehow, but my question is how did you group them and what was your reason for doing it that way? (Or am I overlooking something?) I'm not very good at statistics, but I imagine it could influence the results if, for example, groups with large data sizes had more members.
*

I think if you have a particular distribution of numbers (the compression ratio of different songs) you should get the same value for the deviation no matter what sample size you use. With a bigger sample size you get a better estimation of this deviation, but the number shouldn't rise or fall in a systematic way as you increase the sample size - just fluctuate about a central point. Can anyone confirm or correct this? smile.gif

The program creates a list of 500 compression ratios (if I have 500 songs), so the sample size of '1 track combinations' is 500. It then pairs them up into combinations of 2 tracks, which it would choose every possible combination for, so every possible data size and compression ratio is found. Once you get to looking at combinations of 3 tracks, there are 20'000'000 ways, so it becomes far too slow to calculate every choice. If the sample size is going to be over a million, it doesn't look at every combination, but picks out combinations at random the way a lottery machine picks a combination of 6 numbered balls. It just picks a set of track id numbers, and again has no knowledge about the data size in a particular combination. I think it should end up as being fair, as it picks enough combinations (500'000) to remove the effect of some combinations having an unusually small or large data size.

I hope that explains its operation better to you?
Jasper
QUOTE(Mac @ Jan 7 2005, 06:56 PM)
I think if you have a particular distribution of numbers (the compression ratio of different songs) you should get the same value for the deviation no matter what sample size you use.  With a bigger sample size you get a better estimation of this deviation, but the number shouldn't rise or fall in a systematic way as you increase the sample size - just fluctuate about a central point.  Can anyone confirm or correct this? smile.gif
Sounds logical.
QUOTE
The program creates a list of 500 compression ratios (if I have 500 songs), so the sample size of '1 track combinations' is 500.  It then pairs them up into combinations of 2 tracks, which it would choose every possible combination for, so every possible data size and compression ratio is found.  Once you get to looking at combinations of 3 tracks, there are 20'000'000 ways, so it becomes far too slow to calculate every choice. If the sample size is going to be over a million, it doesn't look at every combination, but picks out combinations at random the way a lottery machine picks a combination of 6 numbered balls.  It just picks a set of track id numbers, and again has no knowledge about the data size in a particular combination.  I think it should end up as being fair, as it picks enough combinations (500'000) to remove the effect of some combinations having an unusually small or large data size.

I hope that explains its operation better to you?
*
Yes, that seems like a reasonable way to do it.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.