Project: jarowinkler

library for fast approximate string matching using Jaro and Jaro-Winkler similarity

Project Details

Latest version: 2.0.1
Home Page: https://github.com/maxbachmann/JaroWinkler
PyPI Page: https://pypi.org/project/jarowinkler/

Project Popularity

PageRank: 0.0017748222873477622
Number of downloads: 107817

JaroWinkler

JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with RapidFuzz.

:zap: Quickstart

>>> from jarowinkler import *

>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297

>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037

🚀 Benchmarks

The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.

Benchmark JaroWinkler

⚙️ Installation

You can install this library from PyPI with pip:

pip install jarowinkler

JaroWinkler provides binary wheels for all common platforms.

Source builds

For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.

pip install git+https://github.com/maxbachmann/JaroWinkler.git@main

📖 Usage

Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:

from jarowinkler import jarowinkler_similarity


jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667

So as long as two objects have the same hash they are treated as similar. You can provide a __hash__ method for your own object instances.

class MyObject:
    def __init__(self, hash):
        self.hash = hash

    def __hash__(self):
        return self.hash

jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111

All algorithms provide a score_cutoff parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297

JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.

from rapidfuzz import process

process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1.       , 0.9037037],
       [0.9037037, 1.       ]], dtype=float32)

👍 Contributing

PRs are welcome!

Found a bug? Report it in form of an issue or even better fix it!
Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.

Thank you :heart: