Fuzzy String matches in TSQL

This post talks about a new preview feature being introduced within MS SQL SERVER database engine. Readers are advised to check out the full details in the link below.

Those working in business intelligence solutions often face the issue of cleaning up incorrect data coming from multiple sources. Sometimes this data isn’t even incorrect. It is just a variation of the same word as spelt in a different region. We are more familiar with this when we deal with dates, for example, in some countries, the date format is MMDDYY whereas in other countries it is DDMMYY, etc. But when it comes to words things get a bit more complex.

The word defence is also spelt as defense. Naturally they mean the same thing but it’s not the same when we have to build reports out of this value. Previously, we would use data quality services, SSIS fuzzy lookup ,Soundex  etc. identify/fix such issues. However, this works primarily in cases for known values that are part of dimension attributes.

To alleviate this issue for more broader cases such as typos or variations of words in product descriptions etc. we can now leverage the DISTANSE set of function inside TSQL. These include

FunctionDescription
EDIT_DISTANCEThe number of edits it will take to make the words match e.g. edit distance for the word space and paces is 2
EDIT_DISTANCE_SIMILARITYA fuzzy lookup score ranging from 1 to 100. useful for cases where the lengths might vary significantly.
JARO_WINKLER_DISTANCEA score based on how similar the starting characters for both strings are.
JARO_WINKLER_SIMILARITYA score range from 0 to 1

Edit_Distance and Edit_Distance_Similarity

The below screenshot shows and example of the usage for EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY, the similarity should be thought of as a fuzzy lookup score ranging from 1 to 100.

declare @word1 varchar(100) ='space' 
declare @word2 varchar(100) ='paces'

;with cte as (
select  @word1  as W1 , @word2  as W2
)

select  *, EDIT_DISTANCE(W1, W2), EDIT_DISTANCE_SIMILARITY(W1,W2) from cte 

An interesting use case of this function is to differentiate between words that are spelt the same but capitalized. such as the name Green vs the color.

declare @word1 varchar(100) ='green' 
declare @word2 varchar(100) ='Green'

;with cte as (
select  @word1  as W1 , @word2  as W2
)

select  *, EDIT_DISTANCE(W1, W2), EDIT_DISTANCE_SIMILARITY(W1,W2) from cte 

JARO_WINKLER_DISTANCE and JARO_WINKLER_SIMILARITY

This is a different algorithm to check similarity of words. In the main reason you would want to use it is for words that differ towards the end. For example the words throw and throwing will return poor scores if you use Edit_Distance function but would return a high score using JARO winkler.

declare @word1 varchar(100) ='throws' 
declare @word2 varchar(100) ='throwing'

;with cte as (
select  @word1  as W1 , @word2  as W2
)

select  *, 
EDIT_DISTANCE(W1, W2) as'dist_score', 
EDIT_DISTANCE_SIMILARITY(W1,W2)  as 'dist_sim',
JARO_WINKLER_DISTANCE(w1,w2) as 'jdist_score' ,
JARO_WINKLER_SIMILARITY(W1,W2) as 'jdist_sim'
from cte 

A place I can immediately see a use for this new set of functions is in cleaning up company name like ‘Abcd LLC’ and ‘Abcd’ etc.

Reference

https://learn.microsoft.com/en-us/sql/relational-databases/fuzzy-string-match/overview?view=azuresqldb-current