- > Company
- > Company Blog
- > Blog Detail
Correcting Human Error Using a Comparison Algorithm
23.08.2010 15:53 ( 0 comments )by Gary Chisholm
It's been a busy few weeks so I'll try to keep this short and sweet and share with you an introduction to a basic comparison algorithm; probably one of the better known to programmers is the Levenshtein Distance Algorithm which will be the focus of this blog.
Firstly a little definition over the differences between a comparison algorithm and a matching algorithm. A matching algorithm would quite simply be "does 'a' equal 'b'", whilst a comparison algorithm might be "is 'a' similar to, or could possibly be, 'b'". So now that the definitions are clarified let's delve into the idea.
So why is a comparison useful?
Simple, humans make mistakes and computers are stupid. If I were to type "potatot" into a search then you or I could quite easily see I meant to have typed 'potato' considering there is no word in the English language called "potatot".
The Levenshtein distance algorithm solves this problem by using what is called an "edit distance" to see how similar/dissimilar two words are based on the characters in those words. An edit distance is the number of changes needed to be made to convert one word into another. The edit actions for each character are add, remove and change and for each action a cost of '1' point is added to the total. Let's explore:
The algorithm requires a dictionary of words, for example our dictionary contains the words:
- cat
- potato
- onion
- tomato
We type our beloved word 'potatot' and compare it against each one.
p o t a t o t
c a t
1 1 0 1 1 1 1 edit distance = 6 points
With the cat example the 'p' and 'o' had to be edited to 'c' and 'a' so we add a point for each. We also have to add four new characters to the end so we add a point for each one of those as well. So our edit distance between potato and cat is 6.
p o t a t o t
p o t a t o
0 0 0 0 0 0 1 edit distance = 1 point
p o t a t o t
o n i o n
1 1 1 1 1 1 1 edit distance = 7 points
p o t a t o t
t o m a t o
1 0 1 0 0 0 1 edit distance = 3 points
So based on these examples we can see that the least number of edits to change "potatot" to a valid word is 1, thus our comparison has found "potato" to be the closest word, followed by "tomato" with "onion" in last... sorry onion.
So how can this help?
Like I said before people make mistakes and computers are stupid. It can help computers account for human error and make suggestions for what the user could have meant based on edit distance.
Obviously this method isn't any Google or office spell-checker, especially since to make any decent guess it has to compare against every item in the list. But it's a good start I guess and hopefully will show you how hard your little spell-checker is working to correct your mistakes :)

Comments