This post talks about a new preview feature being introduced within MS SQL SERVER database engine. Readers are advised to check out the full details in the link below.
Those working in business intelligence solutions often face the issue of cleaning up incorrect data coming from multiple sources. Sometimes this data isn’t even incorrect. It is just a variation of the same word as spelt in a different region. We are more familiar with this when we deal with dates, for example, in some countries, the date format is MMDDYY whereas in other countries it is DDMMYY, etc. But when it comes to words things get a bit more complex.
The word defence is also spelt as defense. Naturally they mean the same thing but it’s not the same when we have to build reports out of this value. Previously, we would use data quality services, SSIS fuzzy lookup ,Soundex etc. identify/fix such issues. However, this works primarily in cases for known values that are part of dimension attributes.
To alleviate this issue for more broader cases such as typos or variations of words in product descriptions etc. we can now leverage the DISTANSE set of function inside TSQL. These include
Function | Description |
EDIT_DISTANCE | The number of edits it will take to make the words match e.g. edit distance for the word space and paces is 2 |
EDIT_DISTANCE_SIMILARITY | A fuzzy lookup score ranging from 1 to 100. useful for cases where the lengths might vary significantly. |
JARO_WINKLER_DISTANCE | A score based on how similar the starting characters for both strings are. |
JARO_WINKLER_SIMILARITY | A score range from 0 to 1 |
Edit_Distance and Edit_Distance_Similarity
The below screenshot shows and example of the usage for EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY, the similarity should be thought of as a fuzzy lookup score ranging from 1 to 100.
declare @word1 varchar(100) ='space'
declare @word2 varchar(100) ='paces'
;with cte as (
select @word1 as W1 , @word2 as W2
)
select *, EDIT_DISTANCE(W1, W2), EDIT_DISTANCE_SIMILARITY(W1,W2) from cte

An interesting use case of this function is to differentiate between words that are spelt the same but capitalized. such as the name Green vs the color.
declare @word1 varchar(100) ='green'
declare @word2 varchar(100) ='Green'
;with cte as (
select @word1 as W1 , @word2 as W2
)
select *, EDIT_DISTANCE(W1, W2), EDIT_DISTANCE_SIMILARITY(W1,W2) from cte
JARO_WINKLER_DISTANCE and JARO_WINKLER_SIMILARITY
This is a different algorithm to check similarity of words. In the main reason you would want to use it is for words that differ towards the end. For example the words throw and throwing will return poor scores if you use Edit_Distance function but would return a high score using JARO winkler.
declare @word1 varchar(100) ='throws'
declare @word2 varchar(100) ='throwing'
;with cte as (
select @word1 as W1 , @word2 as W2
)
select *,
EDIT_DISTANCE(W1, W2) as'dist_score',
EDIT_DISTANCE_SIMILARITY(W1,W2) as 'dist_sim',
JARO_WINKLER_DISTANCE(w1,w2) as 'jdist_score' ,
JARO_WINKLER_SIMILARITY(W1,W2) as 'jdist_sim'
from cte

A place I can immediately see a use for this new set of functions is in cleaning up company name like ‘Abcd LLC’ and ‘Abcd’ etc.