Chapter 14 — Simil: an algorithm to look for similar strings

As it turns out I’m looking for a Simil of “vacation” or “relaxation” and my schedule / life is only allowing for a three string match of “ion” of which the results are “fascination” and “illusion”. Now, if I were wiser I would have set up some “automation” and perhaps my Simil result would be different. Instead I’m going to be living a life of “denormalization” with a touch of “delusion” thrown in to keep me sane.

This is a fun chapter as it takes a traditional problem (wildcard searching) and provides a very unique answer that I’d not heard of before. Tom van Stiphout does a great job of explaining the varying methods of searching for values in a number of ways. He covers Soundex, Equals, Like, Contains, FreeText and then Simil. Of the methods he covers, it all culminates in the create of a function utilizing a .net assembly via CLR.

To top it all off he even offers ALL of the software you need to get up and running with this solution (provided that you have purchased the book). If you or your company struggles with a lot of mis-spellings that result in data cleanup, this might very well be worth looking into.

By utilizing an LCS algorithm, this method may not adequately perform across very large data sets but it would excel as a check in your front end prior to persisting the data in a database. Additionally, there will likely need to be different thresholds defined in order to make Simil a reality in your databases. For instance, the names of pharmaceutical drugs will likely require a different match percentage than common household cleaners. This is largely due to the quantity of letters which are similar in sequence with pharmaceutical drugs than can be found with house hold cleaners. As always there are always exceptions to every rule and while no “fuzzy” matching logic can perfect all spelling errors with proper names or words that cannot be found on webster.com, in no way does it diminish the value that Tom provided as a result of this chapter and the accompanying code.

I intend on working with this code in the weeks to come… it’s exciting and fun to help provide additional value for my clients and I trust that if you spend some time working on this in your local dev environment, you can also provide significant value to the accuracy, integrity and value of your companies data.

Thanks for a great chapter Tom! As a result, I have one more miss”ion” to accomplish before my vacat”ion” plans will come to fruit”ion”.