[Israel.pm] approximate match

Yuval Kogman nothingmuch at woobling.org
Wed Dec 23 04:44:22 PST 2009


BTW, none of these are indexed solutions AFAICT, so they are all O(N)
over the possible matches at lookup time.

If you need something with better lookup times, the standard solution
is to do n-gram (by character) of the inputs, and index those. You can
also index substrings.

At lookup time you do the same thing to the input string.

The number of matches and the size of n compared to the length of the
input and candidate strings can give you a lower bound on the edit
distance.

For example:

"the quick brown fox"
"the quick brown box"

the edit distance is 1 here

if we choose an n of 5, the sequences will be "the q", "he qu" "e qui"
" quic" "quick" etc.

If we match at least one we know the edit distance is at most
length(input) (19) - n (5) = 14.

Otoh, scanning:

"i like cheese"

is probably meaningless, since they aren't similar at all.

Once you take more matches into account it's not as clear cut,
obviously, but hopefully if the database is huge then you at least
have a much smaller candidate list, and you can further sort it by a
number of things (i.e. prioritize substrings with fewer candidate
matches, prioritize candidates which match more substrings, etc),
before you compute the actual edit distance.

The target edit distance lets you choose a threshold, and the higher
it is the longer the substrings can be and the more you can prune from
the search space.

Another similar but simpler approach is to use the nilsimsa hash and
index substrings for collisions.

This doesn't require n-grams because the data is always aligned.

Here are the nilsimsa hashes for the above strings, and another string
with an offset change:

"the quick brown fox"
0a31b4be01a0808a29e0ec60e9e258545dc05267700223082a0a2128708b2edb

"the quick brown box"
0a31b4bf05a08088a9e0ed60e9c858505dc05266700043182a080128688b2edb

"a quick brown box"
181034bf05208088a1a4ed00e9c850105dc04266700063182a080128688b2edb

"i like cheese"
1aa468101a210c0408b48130006010140047640443804a028042e01c8000402a

Then you can split the hashes up into substrings, for instance:

The first and third hashes split up to 32 bit substrings have 3 exact matches:

0a31b4bf 05a08088 a9e0ed60 e9c85850 5dc05266 70004318 2a080128 688b2edb
181034bf 05208088 a1a4ed00 e9c85010 5dc04266 70006318 2a080128 688b2edb

Wheras "i like cheese" has no common substrings with either

1aa46810 1a210c04 08b48130 00601014 00476404 43804a02 8042e01c 8000402a

A trie would be a good fit for such an index, because even partial
matches are meaningful, and the data is more compact.

Again, depending on the length of the substrings and the number of
exactly matching substrings you have some level of confidence about
the overall similarity of the inputs.


More information about the Perl mailing list