Friday, September 28, 2012

Finding similar strings in Java

Finding equal strings in Java is very easy, as well as finding substrings. But what if you have strings that are almost similar (at least from human point of view). For example: "John Doe" and "Doo Jonh".
Recently, I had to do data migration for table with very poor data quality. There is a lot of theory and books on finding duplicates, but surprisingly, there is very little actual Java implementations that could help with such things. Fortunately, there is at least one good - Simmetrics, there is even example how to use it, but it is as simple as one can imagine:


AbstractStringMetric metric = new MongeElkan()
println ( ['John John', 'Doo Jonh', 'ddddddd', '7777777777'].max { metric.getSimilarity('John Doe', it) })

Doo Jonh

There is a bunch of different algorithms one can use, but I found MongeElkan to be best for my case.

