tero.co.uk

String matching functions

Below is a fuzzy string matching function for MySQL. It's a bit like Levensthein or Soundex. It basically converts a short string (of about 10-20 letters) into a number which can then be very quickly compared (using XOR) with the numbers for other strings to see how closely they match. (Note that this only compares from the beginning of shortish string.) This conversion can be done whenever the string is updated, and stored in a different database column, so that the string matching will be very quick (quicker than Levensthein at least).

The core function is XORC, which does the conversion. It divides the string into syllables and then finds the least common letter in each syllable and stores their values in an integer. The 32 bit integer contains the values of 6 such letters. There is also XORM for seeing if strings match and XORP for getting a percentage from the match. So for instance it turns the word "Jamaica Hotel" into a number representing the letters "JCHL" and then compares that with other strings. In this way it can overcome typos and other such errors like "Jamaca Hotel".

To try the functions you must first get the C source code and then compile the functions and import them into MySQL. Instructions are provided in the source code.

It is used like this:

SELECT xorc('Air Jamaica');
SELECT xorp('Air Jamaica', 'Air France');