07 March 2017

question Find the similarity percent between two strings

  1. question

     similar("Apple","Appel") => 80%
     similar("Apple","Mango") =>  0%
    
  2. answer

     from difflib import SequenceMatcher
    
     def similar(a, b):
         return SequenceMatcher(None, a, b).ratio()
    
     >>> similar("Apple","Appel")
     0.8
     >>> similar("Apple","Mango")
     0.0
    
  3. reference Fuzzy string comparison in Python, confused with which library to use [closed]

    1. question

       import Levenshtein
       Levenshtein.ratio('hello world', 'hello')
      
       Result: 0.625
      
       import difflib
       difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()
      
       Result: 0.625
      
    2. answer

       difflib.SequenceMatcher => Ratcliff/Obershelp algorithm
       Levenshtein             => Levenshtein algorithm
      
  4. FuzzyWuzzy: Fuzzy String Matching in Python

    1. string similarity

       from difflib import SequenceMatcher
       m = SequenceMatcher(None, 'new york mets', 'new york meats')
       m.ratio() => 0.9626...
      
       fuzz.ratio('new york mets', 'new york meats') => 96
      
    2. partial string similarity

       fuzz.ratio('yankees', 'new york yankees')       => 60
       fuzz.ratio('new york mets', 'new york yankees') => 75
      
       fuzz.ratio('yankees', 'new york yankees')       => 100
       fuzz.ratio('new york mets', 'new york yankees') => 69
      
    3. out of order

       fuzz.ratio('new york mets vs atlanta braves', 'atlanta braves vs new york mets')          => 45
       fuzz.partial_ratio('new york mets vs atlanta braves', 'atlanta braves vs new york mets') => 45
      
       # token sort
       'new york mets vs atlanta braves' --> 'atlanta braves mets new vs york'
       fuzz.token_sort_ratio('new york mets vs atlanta braves', 'atlanta braves vs new york mets') => 100
      
       # token set
       s1 = 'mariners vs angels'
       s2 = 'los angeles angels of anaheim at seattle mariners'
       # after sort
       t1 = 'angels mariners vs'
       t2 = 'anaheim angeles angels los mariners of seattle vs'
       fuzz.token_set_ratio('mariners vs angels', 'los angels of anaheim at seattle mariners') => 90
      
       fuzz.token_set_ratio('sirhan, sirhan', 'sirhan') => 100
      

references

  1. distance

    1. Hamming distance

    2. Levenshtein distance

    3. Damerau–Levenshtein distance

    4. Jaro–Winkler distance

  2. source code

    1. Levenshtein.c

    2. fuzzywuzzy

  3. doc

    1. difflib


blog comments powered by Disqus