Help - Search - Members - Calendar
Full Version: Collecting ideas for a free, perfect hashing tool
Hydrogenaudio Forums > Misc. > Off-Topic
sn0wman
please submit your ideas which you can't find implemented in any hashing tool around, or which feature is so important that you would not use a program without it.

many thanks, sn0wman.
RedFox
There are already nice hashing & checksum tools, eg: fsum or par2 (includes recovery), but what I miss most is the ability to calculate a hash of audio files that applies only to the audio part.
Ie: I store & update tag values in the files, so any hash calculated for the file would be incorrect after I change the value of a tag in that file.
Some lossless formats include verification (eg: flac), but iirc, mp3 doesn't.
sn0wman
thats the main feature of my oss application, be patient smile.gif.
i want to do as much as possible before releasing it, thats why i posted this topic, now i am asking myself is that only me who wants so much from a hashing utility ? many algorithms implemented, unicode, regular expression for searching the files, shell integration including own context hashing 'profiles', cumulative folder content logging etc ? i hope other ppl will find it usefull, saying nothing of the mentioned extraordinary audio features.
zima
I wonder...how far will you take shell integration? Few steps I imagine:
1) "check integrity of file" option in right click menu
2) field in right click menu that automatically shows whether file is correct when right-clicking (possible?)
3) something in tray that automatically checks files (can be limited/predermined to, for example, only from removable media) and marks their icons "yeah, this one's ok" (possible?)
legg
An online hash database. Even when this is risky, it might be of help for those that rip extremely damage cds.
kwanbis
isn't that accuraterip?
legg
QUOTE (kwanbis @ Apr 10 2006, 08:51 PM) *
isn't that accuraterip?


Dunno, the idea just crossed my mind. But nevertheless it might be a good feature for his tool.
sn0wman
QUOTE (kwanbis @ Apr 11 2006, 04:51 AM) *
isn't that accuraterip?


doesn't accurate rip base on a wav's checksum ?
so using it (?) for ordinary hashing is useless.
however, the application is also able to calculate the lossless files fingerprints so maybe this would be the place for it, nice smile.gif.

thanks for all sugesstions, other are welcome.
emtee
1) Integrity check of md5, sfv, par, par2 files.
2) Directory recursive.
3) Multiplatform.
4) GUI-based.

These would be awesome smile.gif
krmathis
What emtee mention. In addition to the 'hash of audio files that applies only to the audio part' which RedFox suggested.
A command line application is fine for me, since I always have Terminal open anyways..
PiezoTransducer
How about having the resulting hash be simple enough that it can be included as a field of the metadata of the file itself?

I'm not sure what I mean by "simple"... I guess I'm leaving it up to the reader to decide.

It'd be nice if this is something that would eventually find native support in all major music players.

The hash should also ideally be immune to changes to the audio stream that don't affect the decoded output, like audio that's been padded with digital silence should have the same hash as audio without silence.
Triza
QUOTE (emtee @ Apr 14 2006, 11:42 AM) *
1) Integrity check of md5, sfv, par, par2 files.
2) Directory recursive.
3) Multiplatform.
4) GUI-based.

These would be awesome smile.gif


Actually No.

New 4) 1st we need a COMMANDLINE based. Then someone can create a wrapper on the top of that.

5) Open source
6) cross-platform

Otherwise I won't be able to use it.

Triza
p0l1m0rph1c
QUOTE (PiezoTransducer @ Apr 15 2006, 06:35 AM) *
How about having the resulting hash be simple enough that it can be included as a field of the metadata of the file itself?

I'm not sure what I mean by "simple"... I guess I'm leaving it up to the reader to decide.

It'd be nice if this is something that would eventually find native support in all major music players.

The hash should also ideally be immune to changes to the audio stream that don't affect the decoded output, like audio that's been padded with digital silence should have the same hash as audio without silence.


Heh, what can be simpler than 52E2B834 (CRC32) or D75909AF25EF3788957459263AD0D74D (MD5)?
Easily fits into any type of tag.
norz
QUOTE (sn0wman @ Apr 11 2006, 12:04 AM) *
thats the main feature of my oss application, be patient smile.gif.

Maybe you could base the audio decoding part on existing plugins (eg: 1by1 player uses winamp plugins)
Or maybe -given that it's oss- it's better to include existing libraries?
Just a thought, I'm not a developer wink.gif
pepoluan
While we're on the topic of hashes...

Why must audio files be hashed using MD5? It's too complicated I think. I mean, MD5's main function is not for error-checking, rather to prevent willful tampering.

For the normal damages that happen to audio files, CRC32 is enough. Perhaps 2 CRC32 values with different polynomials. Should be quite robust. And it's easier to implement. Not to mention a wholelottafaster.
sn0wman
for the firsth, i am not collecting ideas for something i gonna start with, but for something i have started about 1 year ago, so the work is in very advanced stadium. that implies:
    - application wont be (too far) cross-platform, however i will try to make its engine (there is one) to be;
    - application is GUI based, but commandline parameters passing is on the TODO list, standalone
    commandline version also, and it may (?) be cross-platform;
    - application already features MD5, CRC16&32 and many others;
and now:
    - i like legg's ideas of making use of accuraterip database, online and offline, also par/par2 file checking/creating is a new idea for me.
    - i like zima's idea about the tray icon. i just like, not say i will do that smile.gif !
    - i dont like zima's idea of showing the result in context menu - it sounds very interesting also for me, but we cant forget that showing it (menu) used to be an instant action, we cant wait for the system context menu (hashing !);
    - application will store audio hash in a tag (already on TODO);
QUOTE
Maybe you could base the audio decoding part on existing plugins (eg: 1by1 player uses winamp plugins)

what you mean by that ? audio hash doesnt need encoding, fingerprint does.
PiezoTransducer
QUOTE (sn0wman @ Apr 18 2006, 03:18 PM) *
- application will store audio hash in a tag (already on TODO);

Just an elaboration of my vague comment a couple posts above. It'd be nice if there were a standard hash (fingerprint?) tag for the the audio just like there is a replaygain value for loudness. I may be thinking along different lines from your original intention. I'm thinking on the level of making like... a new RFC, while I think you're talking about just an application.
p0l1m0rph1c
QUOTE (pepoluan @ Apr 19 2006, 03:19 AM) *
While we're on the topic of hashes...

Why must audio files be hashed using MD5? It's too complicated I think. I mean, MD5's main function is not for error-checking, rather to prevent willful tampering.

For the normal damages that happen to audio files, CRC32 is enough. Perhaps 2 CRC32 values with different polynomials. Should be quite robust. And it's easier to implement. Not to mention a wholelottafaster.


Well, no one forced whoever to use MD5 for hashing. And well, MD5 is still the hash algorithm which attains the best speed/security ratio. You could use MD4, which is faster but is known to be flawed. Yeah, you could use CRC32, but you have the probability of 1 in 4 billion that the error will not be detected.

Long shot, but why risk it when you can use MD5 (or whatever, like SHA-1 or <insert hash algo here>). The speed is not that whole better. You can probably go 2x faster with CRC32 than with MD5. Maybe a little more. Either way, your speed is bounded by hard drive speed, not by the algorithm.
SebastianG
QUOTE (p0l1m0rph1c @ Apr 22 2006, 02:29 AM) *
Well, no one forced whoever to use MD5 for hashing. And well, MD5 is still the hash algorithm which attains the best speed/security ratio. You could use MD4, which is faster but is known to be flawed.

So is MD5 IIRC (flawed in terms of security against an intelligent attacker who intentionally wants to create collisions). But If you just want to protect files against "random corruption" CRC32 is fine, too.

However, if you also plan to use the "hash" as some kind of key in a database it better be large (160 bits or more). Note that the probability of a collision with 2^X randomly generated codes of 2X bits length is around 50%.

Sebi
p0l1m0rph1c
Well, yeah. So is SHA-1 (conceptually, not everyone will bother to do 2^63 iterations, heh). My point there was speed. The advantages of MD5 for other uses other than checksumming (you mentioned databases as example), overcome the not-too-large speed penalty over say, CRC32.
rjamorim
QUOTE (p0l1m0rph1c @ Apr 21 2006, 09:29 PM) *
Yeah, you could use CRC32, but you have the probability of 1 in 4 billion that the error will not be detected.


Since nobody is trying to detect intentional tampering here (why would someone bother to tamper the signatures of your music collection? Insert subliminal messages?), I don't see the point of going with full-blown MD5, SHA or WhirlPool. If the case against CRC32 is avoiding collision once every 4 billion times, let's go with CRC64, or CRC256, or CRC65536 if you're really insane tongue.gif

Also, CRC gives you an opportunity to implement some error correction if your stream has few errors ("Correction can also be done if information lost is lower than information held by the checksum"). Cryptographic hashes throw that opportunity out of the window.
norz
QUOTE (sn0wman @ Apr 18 2006, 23:18) *
QUOTE

Maybe you could base the audio decoding part on existing plugins (eg: 1by1 player uses winamp plugins)

what you mean by that ? audio hash doesnt need encoding, fingerprint does.

My mistake: I thought you'd have to decode the audio to hash it (hence to idea of using existing plugins), but I guess you'll just take the audio bits and hash them without any prior processing.

edit: spelling
norz
@sn0wman: Any news on your project?
norz
QUOTE (norz @ Jul 25 2006, 21:27) *
@sn0wman: Any news on your project?

A workaround solution until sn0wman's program is released:
Use a decoder and a hashing program that supports pipes.

Example (on windows):
madplay.exe --output=wave:- "mysong.mp3" | md5sum
This will send a 16bit pcm wave stream to md5sum.
md5sum is a port of gnu utils, from here I think.

I have tested this by replacing some characters in the tags with foobar.
Original and modified files:
- have same size
- have different md5 checksums
- produce decoded wave streams that have the same checksum

---edit begin:
I'm using madplay 0.15.2 (beta).

Regarding tags: my foobar2000 writes id3v1 and ape2 tags to the mp3, and madplay doesn't like this: on those files it will display an error message saying: "error: frame 999: lost synchronization", where 999 is the last decoded frame. However, the md5 checksum will stay the same for an mp3 file without ape2 tags, and after foobar2000 has applied ape2 tag to it.

I've changing my command line a bit:
madplay.exe --output=wave:- --verbose --display-time=remaining %1 | md5sum > %1.md5
This will display remaining time (on terminal) as it processes the file,
and write a .md5 file automatically, which makes it better suited to be called by a batch script to produce .md5 checksums (eg: with sweep)
---edit end
sn0wman
i am not dead, just on holidays now smile.gif
see you soon with some news, ok. (little basic cmd alpha testing version ? [testing the tag-independent engine])

best regards, sn0wman
sn0wman
huh, maybe it is not mentioned above 'basic testing cmd version', but still something isn't it ?
screen1
screen2
screen3
bhoar
QUOTE (sn0wman @ Sep 8 2006, 18:33) *
huh, maybe it is not mentioned above 'basic testing cmd version', but still something isn't it ?
screen1
screen2
screen3


Sure is...when does open testing begin? smile.gif

-brendan
ionication
What a pitty that sn0wman hasn't released his project yet. In the meantime here's another hint for all you guys searching like me:

You can use mp3tag (www.mp3tag.de) and create an individual export function using %_md5audio% to calculate an MD5-Hash of the audio part of the files only. There should also be a way to get this hash into a tag field if desired using mp3tag. The program is extremely flexible. I am using these audio hashes to check whether to files with different tags and filenames contain the same audio (duplicate search).
...Just Elliott
My personal favourite algorithm is sha512. will never run out in the history of the world, collisions are unlikely, and it's damn fast.
sn0wman
tell me in the meantime please, what md5 (svf etc) programs are you using mostly. i need to know their logs' format.
Synthetic Soul
FSUM (command line) and QuickSFV (Explorer context menu).
Jojo
QUOTE (RedFox @ Apr 10 2006, 12:06) *
the ability to calculate a hash of audio files that applies only to the audio part.

I second that. mp3Tag (http://www.mp3tag.de/en/) is able to do that. Also, maybe you could base your tool on md5summer (http://www.md5summer.org/). It's small program I like, but it hasn't been updated for quite some time.
sn0wman
QUOTE (Jojo @ Mar 9 2007, 20:39) *
It's small program


you are so right. i have looked on md5summer, it is small indeed. and thats why it takes me so long with audiohash, cause it is not small.i keep on working. believe me, the final result should be worth waiting. except for the ones who are waiting for linux version ;(.

still waiting for some exotics log formats !
Polarix
Sad to see this never get released. That's why things are released incrementally, so when you finally call it quits the work isn't lost.

I kind of expected this from the start. Would've been Nice.
Preuss
How long are you coming with your utility?
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.