library for fast approximate string matching using Jaro and Jaro-Winkler similarity
>>> from jarowinkler import *
>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297
>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.
You can install this library from PyPI with pip:
pip install jarowinkler
JaroWinkler provides binary wheels for all common platforms.
For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
pip install git+https://github.com/maxbachmann/JaroWinkler.git@main
Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:
from jarowinkler import jarowinkler_similarity
jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
So as long as two objects have the same hash they are treated as similar. You can provide a __hash__
method for your own object instances.
class MyObject:
def __init__(self, hash):
self.hash = hash
def __hash__(self):
return self.hash
jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
All algorithms provide a score_cutoff
parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.
from rapidfuzz import process
process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1. , 0.9037037],
[0.9037037, 1. ]], dtype=float32)
PRs are welcome!
Thank you :heart:
Copyright 2021 - present maxbachmann. JaroWinkler
is free and open-source software licensed under the MIT License.