Project: fuzzysearch

fuzzysearch is useful for finding approximate subsequence matches

Project Details

Latest version: 0.7.3
Home Page: https://github.com/taleinat/fuzzysearch
PyPI Page: https://pypi.org/project/fuzzysearch/

Project Popularity

PageRank: 0.0015528850023319469
Number of downloads: 87598

=========== fuzzysearch

.. image:: https://img.shields.io/pypi/v/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Latest Version

.. image:: https://img.shields.io/travis/taleinat/fuzzysearch.svg?branch=master :target: https://travis-ci.org/taleinat/fuzzysearch/branches :alt: Build & Tests Status

.. image:: https://img.shields.io/coveralls/taleinat/fuzzysearch.svg?branch=master :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master :alt: Test Coverage

.. image:: https://img.shields.io/pypi/wheel/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Wheels

.. image:: https://img.shields.io/pypi/pyversions/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Supported Python versions

.. image:: https://img.shields.io/pypi/implementation/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Supported Python implementations

.. image:: https://img.shields.io/pypi/l/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch/ :alt: License

Fuzzy search: Find parts of long text or data, allowing for some changes/typos.

Easy, fast, and just works!

.. code:: python

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

Two simple functions to use: one for in-memory data and one for files
- Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
- Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
Simple installation:
- pip install fuzzysearch just works
- pure-Python fallbacks for compiled modules
- only one dependency (attrs)
Extensively tested
Free software: MIT license <LICENSE>_

For more info, see the documentation <http://fuzzysearch.rtfd.org>_.

Installation

fuzzysearch supports Python versions 2.7 and 3.5+, as well as PyPy 2.7 and 3.6.

.. code::

$ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.

Usage

Just call find_near_matches() with the sub-sequence you're looking for, the sequence to search, and the matching parameters:

.. code:: python

>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

To search in a file, use find_near_matches_in_file() similarly:

.. code:: python

>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
...     find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

Examples

fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for "heavier", domain-specific tools like BioPython:

.. code:: python

>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

.. code:: python

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

Matching Criteria

The search function supports four possible match criteria, which may be supplied in any combination:

maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions ("delete" = skip a character in the sub-sequence)
maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.

.. code:: python

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]

# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]

# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1

# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]

When to Use Other Tools

Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.

Suggestion: Consider using fuzzywuzzy <https://github.com/seatgeek/fuzzywuzzy>_.

History

0.7.3 (2020-06-27) ++++++++++++++++++

Fixed segmentation faults due to wrong handling of inputs in bytes-like-only functions in C extensions.

0.7.2 (2020-05-07) ++++++++++++++++++

Added PyPy support.
Several minor bug fixes.

0.7.1 (2020-04-05) ++++++++++++++++++

Dropped support for Python 3.4.
Removed deprecation warning with Python 3.8.
Fixed a couple of nasty bugs.

0.7.0 (2020-01-14) ++++++++++++++++++

Added matched attribue to Match objects containing the matched part of the sequence.
Added support for CPython 3.8. Now supporting CPython 2.7 and 3.4-3.8.

0.6.2 (2019-04-22) ++++++++++++++++++

Fix calling search_exact() without passing end_index.
Fix edge case: max. dist >= sub-sequence length.

0.6.1 (2018-12-08) ++++++++++++++++++

Fixed some C compiler warnings for the C and Cython modules

0.6.0 (2018-12-07) ++++++++++++++++++

Dropped support for Python versions 2.6, 3.2 and 3.3
Added support and testing for Python 3.7
Optimized the n-grams Levenshtein search for long sub-sequences
Further optimized the n-grams Levenshtein search
Cython versions of the optimized parts of the n-grams Levenshtein search

0.5.0 (2017-09-05) ++++++++++++++++++

Fixed search_exact_byteslike() to support supplying start and end indexes
Added support for lists, tuples and other Sequence types to search_exact()
Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0
Added more tests and improved some existing ones.

0.4.0 (2017-07-06) ++++++++++++++++++

Added support and testing for Python 3.5 and 3.6
Many small improvements to README, setup.py and CI testing

0.3.0 (2015-02-12) ++++++++++++++++++

Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage

0.2.2 (2014-03-27) ++++++++++++++++++

Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs

0.2.1 (2014-03-14) ++++++++++++++++++

Fixed major match grouping bug

0.2.0 (2013-03-13) ++++++++++++++++++

New utility function find_near_matches() for easier use
Additional documentation

0.1.0 (2013-11-12) ++++++++++++++++++

Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha

0.0.1 (2013-11-01) ++++++++++++++++++

First release on PyPI.

Project: fuzzysearch

Project Details

Project Popularity

=========== fuzzysearch

Installation

Usage

Examples

Matching Criteria

When to Use Other Tools

History

Related Projects