Finding equal strings in Java is very easy, as well as finding substrings. But what if you have strings that are almost similar (at least from human point of view). For example: "John Doe" and "Doo Jonh".
Recently, I had to do data migration for table with very poor data quality. There is a lot of theory and books on finding duplicates, but surprisingly, there is very little actual Java implementations that could help with such things. Fortunately, there is at least one good - Simmetrics, there is even example how to use it, but it is as simple as one can imagine:
There is a bunch of different algorithms one can use, but I found MongeElkan to be best for my case.
Recently, I had to do data migration for table with very poor data quality. There is a lot of theory and books on finding duplicates, but surprisingly, there is very little actual Java implementations that could help with such things. Fortunately, there is at least one good - Simmetrics, there is even example how to use it, but it is as simple as one can imagine:
import uk.ac.shef.wit.simmetrics.similaritymetrics.*;
AbstractStringMetric metric = new MongeElkan()
println ( ['John John', 'Doo Jonh', 'ddddddd', '7777777777'].max { metric.getSimilarity('John Doe', it) })
Doo Jonh
There is a bunch of different algorithms one can use, but I found MongeElkan to be best for my case.
No comments:
Post a Comment