If you are working on service which need to aggregate place names from different sources, like Google Places, Foursquare and local db for example, then you always have a problem not to display same place more then one time. Imagine you have Paolo Pizza at Backer Street 43 and all three data sources return it in search result, then you need to show only one from preferred data source.
Solution for me was a two step algorithm.
- Distance between places less then 200 meters
- Their names are similar
- It is same Places
With 1 everything is clear as I have lat,long for each place and can calculate distance with simple Haversine algorithmus
The most interesting was to find the best way to compare place names. After some googling I stopped on three algorithms JaroWinkler, SmithWatermanGotoh, Soundex and lib which support all of them – simmetrics. I made a test to find a best algorithm for my needs
Just to explain: I want an easy way to see that Antonio Cafe is same to Antonio’s Cafe and same to Antonio Kafe but differs from Antonio hotel and from Mary Coffee.
Soundex algorithm appeared the most appropriate for my needs. And in code I use boundary 0.98 when comparing two words
That’s our way of solving such a problem. If you have any other experience with similar problems, please post it in comment.