Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: AccurateRip - Future Direction (Read 49856 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

AccurateRip - Future Direction

It has been brought to my attention that the CRC used in AccurateRip is not doing its job propperly, in laymans terms the Right Channel rolls out of the CRC Calculation every 1.5 seconds (that is 1st sample right channel is used 100%, by the 65535 sample it is not used, 65536 sample it is used 100% again, this repeats over and over). It is estimated that effectively 3% of the data is not getting into the CRC (at a 97% coverage, I stand behind AccurateRip @ 97% is better than most (? all) c2 implementations). Going back over the early AccurateRip code it seems the design of the CRC is fine, just the implementation (L and R channels were supposed to go in seperately, but were optimized to both go in without bringing down the upper 32 bits).

Steve will post his findings in detail on his discovery.

It is a relatively easy fix (detailled below), however this presents an opportunity, which was not around when AccurateRip was first implemented (the understanding of different CD pressings and how they were implemented was almost non-existing).

----------------------------
1. Fix: Fix the algorithm so all the data is used, both new and old CRC are calculated, new checked first, old second (with less Accuracy). New submissions would effectively appear as different pressings in the database.
----------------------------
2. Fix : Change the CRC algorithm to something like CRC32, the reason it was not used in the first place, was tracks 2 to x-1 would match the CRC presented in EAC, but 1 and last would never, causing confusion, the CRC could be XOR'd to avoid this confusion.
----------------------------
3. Fix & Additional Development: Use CRC32 and the old CRC (there is lots of data in the database), new CRC32 would go into a parallel 2nd database, increasing the strength of the CRC to almost 64 bits (not taking into account the flaw). Back end there is little changes to be made, both databases are the same design.
----------------------------
4. Fix & Additional Development: Use a different hash, MD5, sha-1, these would increase storage of the database by 5x (160bits of sha-1).
----------------------------
5. Brainstorm a method of having a hash which would be resistant to pressings, yet still be feasable for a CD ripper to rip track rather than whole CD based (and not have the need to read outside of the track).
----------------------------
6. ???

Bear in mind the existing database before construction takes up some 14 GB.

AccurateRip - Future Direction

Reply #1
My information about the implementation of the Accurate Rip CRC algorithm derives from Christoper Key's ARcue.pl Perl script available at
http://www.hydrogenaudio.org/forums/index....showtopic=53583

For each incoming data sample, the Right and Left 16-bit info is grouped into a single 32 bit word.  That word is multiplied by another 32 bit number, called frame_offset in the code, which is really just the sample's address in the file, i.e. the sample number.

All that's done is to multiply the two numbers together.  This results in a 64 bit product of which only the bottom 32 bits are preserved when loading it back into the $CRC integer variable.  A running total of the sample times its address is kept to produce the final CRC.

This algorithm is not a CRC at all, but a checksum, and a badly implemented one.  The problem is that the multiply shifts the high order bits of the sample out of the 32 bit window that is stored in $CRC and never rotates that data back into the low order bits.  It's really only half of a checksum.  This means that many of the high order bits of the right channel data do not participate at all in the checksum calculation.

In more detail, let D represent the 32 bit data word, and A (for Address) represent the data word's position in the file.  The calculation is that the CRC increment will equal D * A.  Now let's partition D into a high order m bit portion DH and a low order (32 - m) bit portion DL.  We'll reverse partition A into a high order 32 - m bits AH and a low order m bits AL.


D = ---DH---|----DL------- = DH * 2^(32-m) + DL
A = ------AH-----|---AL---- = AH * 2^m    + AL

Now multiply D times A.  The result is:

(DH * 2^(32-m) + DL) * (AH * 2^m + AL)  =  (DH * AH) * 2^32 + (DH * AL) * 2^(32-m) + (DL * AH) * 2^m + DL * AL

Take the low order 32 bits of the above product and the first term disappears, leaving

Low_32-bits_of ( (DH * AL) * 2^(32-m) + (DL * AH) * 2^m + DL * AL )

Now what happens if m low order bits of A are zero? Meaning that AL itself is zero.  The product then becomes

Low_32-bits_of (DL * AH * 2^m)

and the result is insensitive to all m bits of DH, which no longer appear in the result.

So how does this manifest itself? Every other sample has an even value of A, so that AL = 0 for m=1, and the high order bit of D is shifted out of the calculation.  Every fourth sample has two LSBs of A = 0, making m=2, shifting two MSBs of D out of consideration.  Every eighth sample has 3 MSBs shifted into oblivion and so on.  Every 2^16 = 65,536 samples, the entire 16 bits of the Right channel of the sample is shifted away, meaning that it has no effect at all on the final checksum.  This happens once every 65536 / 44100 = ~1.5 seconds.

Adding all these missed bits together, we find that 3% of the bits in the file do not participate in the final Accurate Rip CRC.  This is a coverage of only 97%, where a properly designed 32 bit CRC, or even a simpler checksum for that matter would give 99.99999997% !

This low coverage may explain some of the problems that people are reporting where their ripping software is indicating errors and Accurate Rip says that everything is fine.

So what to do next?  Choose an algorithm that at least has proper coverage given 32 bits, such as CRC-32.  Possibly go to a longer CRC such as CRC-64 or even to a 128 bit hash like MD5.  The now obsolete cryptographic properties of MD5 are not relevant here.  It would make a fine "CRC" with ridiculously high coverage percentage.

See Wikipedia for more details:
http://en.wikipedia.org/wiki/Cyclic_redundancy_check
http://en.wikipedia.org/wiki/MD5
http://en.wikipedia.org/wiki/Cryptographic_hash_function

AccurateRip - Future Direction

Reply #2
Reading Spoon's initial post to this thread I see that the intention was to calculate left and right channel checksums independently, so that the product of two 16 bit numbers would fit in a 32 bit integer, whose upper sixteen bits could then be shifted down and added to the lower to make a correct checksum with full coverage.

So this is really a code optimization bug.

It's really a shame that C2 is done so poorly in most CD-ROM readers, and I agree that 97% coverage is probably better, but we can get to "nine nines" of coverage by fixing the bug.

AccurateRip - Future Direction

Reply #3
Concerning spoon's point #5: Considering that most different pressings are really bit identical except for a few samples at the beginning and the end of the CD due to a different offset in production, wouldn't rolling hashes help to find the right offset from which on creating a checksum is to be done? Something similar like rsync does. Then the new AR dll would calculate two track CRCs, the one with the "corrected offset" to match a pressing already in the database and the one without offset correction ("corrected" doesn't mean read offset correction here, of course).

The required rolling hashes for the first samples of a CD could be added to the existing database whenever an Track 1 of that CD is successfully ripped with the new version of AR by a user.

AccurateRip - Future Direction

Reply #4
I personally vote for using the MD5 hash of the PCM data.  The properties of MD5 are well known, and as Steve Gabriel points out the cryptographic weaknesses of the algorithm are irrelevant in this case.  There are many reference implementations and there are test suites for verification.  It seems silly to reinvent the wheel when so many excellent hashing and CRC algorithms exist.

AccurateRip - Future Direction

Reply #5
I lean toward option 4, using MD5 as well, but I wonder about the database expansion factor of 4x.  You said the database is now 14 GB, yet the Accurate Rip website says there are "only" about 460K CDs.  That's maybe 6 million tracks, which means that each track consumes 2.3 KB in the database.  Is that right?  What portion of the DB is devoted to actual CRCs?

Also, how prevalent is the multiple pressing problem?  On a typical CD how many variants are there?  Is the most common way a variant presents itself is as a simple offset of all the samples; as if there was a write offset error at the production facility and all the samples are otherwise identical?

AccurateRip - Future Direction

Reply #6
In the DB are stored comp ident - big long OLE type identifier, CRC of track + crc of offset. I used to be able to construct the final database using huge (GB's) ram disks, was nice and fast, it reached a point beyond that and it takes some 24-48 hours to construct the final database.

Pressing problems are a big problem for AR (the whole CD is shifted by a certain number of samples, perhaps 1000 in some examples), they give wrong offsets when finding a drives offset (even with 3 discs, people can still get the wrong offset, say 1 in 3000), for common drives with a strong known offset, AR will only key to the known offset.

About database sizes, 14GB now, in 2-3 years if things were left as they are it could be 40 GB, if the plans for '6.' (see below) were implemented, that would be 80GB.

(have a cup of coffee before reading this...)

6. I think I have the solution! as it stands in the database for each track (forget pressings for the moment) is a track CRC (which has the flaw) and an offset finding CRC (which does not have the flaw).

I will be talking about 2 databases, side by side, the existing database is DB1 and new is DB2

[DB1] Work should be done in EAC and dBpoweramp ASAP to correct the flaw, each program should calulate 2 CRCs , the old one and the new one. Only the new one should be submitted once the fix is implemented. The old CRCs would in time be replaced by the new CRCs in the same database.

[DB2] In addition a 2xCRC32's should be generated:

[CRC1][..............CRC2............][CRC1]

So CRC1 is the first say 5 frames and last 5 frames of the track, CRC2 is all the track. These 2 CRCs could be submitted to a 2nd database, where the CRC1 will go into the current offset finding slot, no changes on the backend! (apart from creating the 2nd database)

Why do this? It would allow a match if your CD is a different pressing and not really in the database, no rolling CRCs are needed as the CRC from the existing database that is used to find offsets of drives can find the offset of the pressing and as long as it is < 5 frames +-, the pressing can be verified. It also has the benifit with track 1 (which currently is only calculated from 5 frames in) for any drive with a + offset it would have the correct CRC1, so all of track 1 could be verified in its entireity (not possible for the last track as majority of drives cannot overread).

When I started AccurateRip the idea of pressings messing the audiodata was not known (to me), if you had 40 different pressings of the same CD (could be with worldwide releases over 10 years) that lowers the 1 in 4 billion of a working 32-CRC routine to 1:100 Million of the chance of a CRC clash, adding the 2nd CRC would boost CRC to 64 bits effectively. Then AccurateRip could return:

Match using old CRC method,
Partial pressing match (10 frames of the file missing)
Match using CRC fix method (32 bit), in additon CRC32 match (on CRC1 and CRC2, so whole track)

All that would need to be done is a method of showing which of the above to the end user.
Changing to MD5 would mean the whole backend being rewritten, and there is about 30x more code on the backend - to keep the database clean from rouge data, such as drives configured with wrong offsets.

AccurateRip - Future Direction

Reply #7
Hi Spoon,

great to see you asking for features or suggestions. 

AR is an excellent tool to make sure the rips are done well, but I see a few problems with the way submissions are handled. It's possible you already implemented remedies for these without me noticing, but here we go...:

"Confidence 1"

I often times find myself ripping a perfect CD on my (of course...) perfect system, with all tracks ripping at 100% in EAC secure mode Test&Copy, but in the end only 50% of the tracks are "accurately ripped (confidence 1)" and the other 50% are not matching. It feels like someone submitted these results while ripping in a suboptimal fashion, and thereby tainting the database with lousy AR checksums. Are all the other, better AR results not being registered? Is it a first come - first database entry thing?

This goes into a similar direction as the multiple pressings: Why not save all submissions of one CD, rank them, and tell the user "accurate, confidence 6", "not accurate, confidence 1", i.e. there are 6 others with apparently the same rip result, while there is one submission, that reported a different CRC.  This could go in addition to your idea of creating separate sums for the main part of a track and the first and last 5 frames.


Different Pressings

Your idea sounds great, but I think we should definitely test it in order to figure out whether those 5 frames are enough. If you need some beta testing, let me know!

Thank you.

AccurateRip - Future Direction

Reply #8
>Are all the other, better AR results not being registered? Is it a first come - first database entry thing?

No, when someone verifies your rip results they would appear in the database.

>"accurate, confidence 6", "not accurate, confidence 1",

The confidence 1 is just from any of the pressings, it was likely your rip failed as it failed the confidence 6 (which is in the database)

5 Frames is 2940 samples, there might be CDs with a wider range but it would be wrong to try to accomadate all the pressings out there (plus a pressing near to the one on the outside of the range would verify it).

AccurateRip - Future Direction

Reply #9
This idea sounds pretty good, but I don't understand it fully yet.  How is the offset detection "CRC" computed?  How does it help in finding drive and pressing offsets?

First I think we need a terminology change.  CRC to me means specifically an algorithm based on polynomial division.  What you are calling a CRC is a type of checksum. I suggest calling your algorithm ARCS for Accurate Rip CheckSum. 

The algorithm names will be

ARCS-F  Accurate Rip CheckSum - Flawed
ARCS    Accurate Rip CheckSum - correct
ARO  AR Offset detection checksum (this is the part I don't understand yet)
CRC32  An actual polynomial division based CRC such as CRC-32 IEEE 802.3

ARCS-F adds together increments that are

Low_32_bits_of(sample * i)

ARCS correct should add

Low_32_bits_of(sample * i) + (High_32_bits_of(sample * i) >> 32).

Let's name the slots in the two DBs:

DB1-TC  Data Base 1 Track data Checksum
DB1-OD  Data Base 1 Offset detection Checksum
DB2-TC
DB2-OD

So what I think you're proposing is (I've edited this multiple times, sorry if confusing)

DB1-TC currently gets ARCS-F of full track data minus 5 leading frames if a first track and minus 5 ending frames if a last track.  Slowly replace with either ARCS correct or CRC-32 of same data?
DB1-OD gets ARO (maybe the same as ARCS, don't know how this works)
DB2-TC gets CRC32 of track data, always minus 5 leading and 5 ending frames, or do you really mean ARCS of that data?
DB2-OD gets either CRC32 or ARCS or something else? based on the 5 leading and 5 ending frames

AccurateRip - Future Direction

Reply #10
The offset detetion is simply a CRC (checksum) of 1 frame, in the first track, the program can hunt for the right offset.

> Slowly replace with either ARCS correct or CRC-32 of same data?

Correct ARCS.

>DB2-TC gets CRC32 of track data, always minus 5 leading and 5 ending frames, or do you really mean ARCS of that data?

CRC32

>DB2-OD gets either CRC32 or ARCS or something else? based on the 5 leading and 5 ending frames

CRC32, correct.

AccurateRip - Future Direction

Reply #11
The offset detetion is simply a CRC (checksum) of 1 frame, in the first track, the program can hunt for the right offset.

So each track entry in the database has the ARCS of a single frame of the first track on the CD?  AR scans all offsets +- about 2000 of the first track data through that frame window to find a match to insure that the read offset is set right?  You are also saying that the current contents of the DB1-OD field were computed with ARCS and not ARCS-F so that only DB1-TC has the bug?  Which frame number of track 1 did you pick?
Quote
>DB2-OD gets either CRC32 or ARCS or something else? based on the 5 leading and 5 ending frames
CRC32, correct.

So now you slide data through a 10 frame window (5 leading and 5 trailing) looking for a pressing match?

I assume the pressing scan has to do something different for a first or last track, such as only calculate DB2-OD on a leading or trailing 5 frames.

AccurateRip - Future Direction

Reply #12
[DB2] In addition a 2xCRC32's should be generated:

[CRC1][..............CRC2............][CRC1]

So CRC1 is the first say 5 frames and last 5 frames of the track, CRC2 is all the track. These 2 CRCs could be submitted to a 2nd database, where the CRC1 will go into the current offset finding slot, no changes on the backend! (apart from creating the 2nd database)

At first I wondered why you needed two CRC32s for DB2, you could just calculate CRC2 for the entire track.  But it's starting to dawn on me that you're trying to preserve the database logic exactly, so you need to calculate CRC2 for the middle frames only, just like in DB1.  You add on the missing ten frames with CRC1.  It's not used in the pressing scan at all.  That can use the existing DB1-OD field like you said.

Sorry to be so dense.


AccurateRip - Future Direction

Reply #14
In the DB are stored comp ident - big long OLE type identifier, CRC of track + crc of offset.

Given this record structure, you have 128 bits for the computer ID, and 32 each for the checksums, so it seems that adding another checkword of 32 bits or even 128 bits, only expands the size by less than 2x. Obviously I'm missing some info about what's really going on.

How many records are in the DB?  How long is a record?  There are 16 M tracks in the DB and its size is 14 GB, so that tells me that there are an average of 5 entries per track.  This is probably distributed as a "long tail", with many discs having only 1 entry (confidence 1) and a few popular ones with thousands of submissions.

Just out of curiousity, do you have any easily available statistics on the long tail structure.  How many of the tracks are sitting at confidence level 1 right now, vs. 2 or 10 or 100 ?


AccurateRip - Future Direction

Reply #16
>DB1-OD field were computed with ARCS and not ARCS-F

They are calculated over ~544 samples, so the bug does nto come into play, the offset is just 1 frame about 50 in (off the top of my head).

>There are 16 M tracks in the DB and its size is 14 GB, so that tells me that there are an average of 5 entries per track

I would say that is right, the 2nd database I would keep serperate from the first so would need the overhead.

>one rolling checksum that will cover different pressings, how much would that shrink the db?

It would not, the above design calls for each pressing to be stored individually, a match on a pressing not in the database is only a partial match (missing 10 frames).

AccurateRip - Future Direction

Reply #17
They are calculated over ~544 samples, so the bug does into come into play, the offset is just 1 frame about 50 in (off the top of my head).

If the same code was used for DB1-OD, then the bug will show up.  Remember that every even numbered sample shifts the MSB of the right channel out of the result.  This happens inside even a single frame since you are multiplying by sample number.  Samples with a multiple of a power of two offset shift the exponent of that multiple number of bits out.  The sample number being used is the low order 32 bits of the offset from the beginning of the file, not just the offset from the frame beginning.

DB1-OD's purpose is solely offset detection, so having at least complete coverage of the left channel data is probably good enough.
Quote
>There are 16 M tracks in the DB and its size is 14 GB, so that tells me that there are an average of 5 entries per track

I would say that is right.

Does this this mean that there is one record for every submission (based on computer-ID) for each track?  If a new submission comes in that's already there it just updates the CRC fields for that ID?

AccurateRip - Future Direction

Reply #18
>DB1-OD's purpose is solely offset detection, so having at least complete coverage of the left channel data is probably good enough.

Exactly, even if only the left channel was used on its own, it would work.

Yes each submission has a record, resubmissions are dropped, not replaced.

AccurateRip - Future Direction

Reply #19
Is there ANY way that we can avoid using the DISCID in the catalog?  The presence or absence of a data track (and it's size) makes checking audio files for accuraterip accuracy impossible with the current database structure when only the audio files are available.

AccurateRip - Future Direction

Reply #20
We used everything from the CD TOC to cut the number of collisions, including the data track on CD Extra.

AccurateRip - Future Direction

Reply #21
4. Fix & Additional Development: Use a different hash, MD5, sha-1, these would increase storage of the database by 5x (160bits of sha-1).

At least with MD5 we would be able to check them directly against MD5 sums already stored in FLAC and WavPack files.
With SHA-1, you would enforce a new standard for file identification that would benefit from hardware implementations that bring the computational cost of hashing down to zero (MD5 will fade away from hardware chips because of its cryptographic weakness). Since I'm always more inclined to look ahead instead of back, I vote for SHA-1.

AccurateRip - Future Direction

Reply #22
Quote
We used everything from the CD TOC to cut the number of collisions, including the data track on CD Extra.


That doesn't mean it's the right thing to do.  I understand the issues with collisions, but I think there's very few collisions based on the presence or absence of a data track in the TOC, and the usefulness of AR goes up significantly if you can test an arbitrary set of lossless audio data against the db.  I think a few more collisions versus a significant improvement in the functionality should be considered...

AccurateRip - Future Direction

Reply #23
Yes each submission has a record, resubmissions are dropped, not replaced.

If I understand this correctly, you may be rejecting good submissions. From reading the posts, many people first read in burst mode and if an error shows up in the AR database, they than use secure mode. Assuming that the re-rip is good, the first submission is bad but would be accepted by the database. Than when the good submission is submitted, it would be rejected. If this is correct, you are rejecting a substantial amount of good data.
Glass half full!

AccurateRip - Future Direction

Reply #24
How would the system know if it is a good submission? if there are the tracks already in the database with a confidence of 2 or higher adding 1 more to it does not add much interms of value, maintaining 1 computers 100x rips on a dodgy CD as they attempt to recover is not good sense.

-----
@cerbus

If you have the album name and artist you can get a TOC from MusicBrainz, do it that way around.