Quantcast
Channel: What is the best algorithm for matching two string containing less than 10 words in latin script - Stack Overflow
Viewing all articles
Browse latest Browse all 6

What is the best algorithm for matching two string containing less than 10 words in latin script

$
0
0

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.

Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:

https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics

BlockDistanceChapmanLengthDeviationChapmanMatchingSoundexChapmanMeanLengthChapmanOrderedNameCompoundSimilarityCosineSimilarityDiceSimilarityEuclideanDistanceInterfaceStringMetricJaccardSimilarityJaroJaroWinklerLevenshteinMatchingCoefficientMongeElkanNeedlemanWunchOverlapCoefficientQGramsDistanceSmithWatermanSmithWatermanGotohSmithWatermanGotohWindowedAffineSoundex

but I'm not well versed in these algorithms and what would be a good choice ?

I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.

Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.


Viewing all articles
Browse latest Browse all 6

Latest Images

Trending Articles





Latest Images