Comparing two cd databases, Need formats for cdinfo and / or regexes |
![]() ![]() |
Comparing two cd databases, Need formats for cdinfo and / or regexes |
Nov 20 2003, 16:53
Post
#1
|
|
![]() Group: Members Posts: 81 Joined: 17-April 03 Member No.: 6024 |
Hi all!
My challenge / problem is this: 1. I have freedb (and can parse it, with perl, to output "artist<TAB>album"), around 140K entries - each entry is a CD 2. I have "another" database (which is also in the format "artist<TAB>album"), more than 10K entries And I wish to find out how many of the entries in 1. can be found in 2.. Unfortunately, I have no experience in generating dynamic patterns, nor in using Spell / ISpell (checks for spelling errors). So, I am curious as to what (perl packages / other) you would recommend as a "most probably successfull way to do it". Database 2. is currently so messy that I can only find around 5% of the entries therein, in 1.. At this stage, I am gratefull for any and all suggestions. PS! Oh, and I use linux, so I have access to all tools available for that platform. This post has been edited by DrDoogie: Nov 20 2003, 16:54 |
|
|
|
Nov 21 2003, 10:08
Post
#2
|
|
|
Group: Members Posts: 189 Joined: 9-July 02 Member No.: 2536 |
Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.
|
|
|
|
Nov 21 2003, 23:29
Post
#3
|
|
![]() Group: Members Posts: 81 Joined: 17-April 03 Member No.: 6024 |
QUOTE (Jasper @ Nov 21 2003, 01:08 AM) Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table. Mmm, I suppose I could use some "case-insensitive 'like'" stuff in MySQL, but why? Perhaps you don't know what a regular expression is. Say that you have the name of an artist in two formats: A. "Mike Oldfield" B. "Oldfield, Mike" In order to match these two, you need a regular expression. Say for instance with this: CODE s/([^,]*),\s(.*)/$2 $1/ Also, for some various erroneous entries in albumtitle, I have currently come up with some other patterns, which I read from a file as: CODE while (<album_patterns>) { chomp; if (!(/^$/ || /^#/)) { my ($pattern, $replacement, $modifier) = split /\t/; $pattern =~ s/^'(.*)'$/$1/; $replacement =~ s/^'(.*)'$/$1/; $modifier =~ s/^'(.*)'$/$1/; $albumPatterns{$pattern} = $replacement; } } These are the patterns, though I should note that they are not finished yet. Also, the unicode setup on my box i f'ed, so I have to devise the patterns somewhat 'tarded: CODE # year
#'(\D('[1-9]\d|[1-9]\d{3}))' '[YEAR: $1]' # yearspan #'(\D('[1-9]\d|[1-9]\d{3}))(\s*.?\s*)(('[1-9]\d|[1-9]\d{3})\D?)' '[YEARSPAN: $2$3$5]' # volumenumber #'[Vv]ol(ume|\.)[\W\s]?(\d*|[a-zA-Z]*)' '[VOLUMENUMBER $2]' # volumespan #'[Vv]ol(\.|ume)?s?[\W\s]+(\w+)(.*[Vv]ol(\.ume)?s?)?(\W+(\w+))' '[VOLUMESPAN: $2_$6]' This post has been edited by DrDoogie: Nov 21 2003, 23:30 |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 20th May 2013 - 06:48 |