Project: confusable-homoglyphs

Detect confusable usage of unicode homoglyphs, prevent homograph attacks.

Project Details

Latest version
3.2.0
Home Page
https://github.com/vhf/confusable_homoglyphs
PyPI Page
https://pypi.org/project/confusable-homoglyphs/

Project Popularity

PageRank
0.0021443384914200288
Number of downloads
58611

confusable_homoglyphs [doc] <http://confusable-homoglyphs.readthedocs.io/en/latest/>__

.. image:: https://img.shields.io/travis/vhf/confusable_homoglyphs.svg :target: https://travis-ci.org/vhf/confusable_homoglyphs

.. image:: https://img.shields.io/pypi/v/confusable_homoglyphs.svg :target: https://pypi.python.org/pypi/confusable_homoglyphs

.. image:: https://readthedocs.org/projects/confusable_homoglyphs/badge/?version=latest :target: http://confusable-homoglyphs.readthedocs.io/en/latest/ :alt: Documentation Status

a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph <https://en.wikipedia.org/wiki/Homoglyph>__

Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.

  • AlaskaJazz is single script: only Latin characters.
  • ΑlaskaJazz is mixed-script: the first character is a greek letter.

You might also want to avoid people being tricked into entering their password on www.microsоft.com or www.faϲebook.com instead of www.microsoft.com or www.facebook.com. Here is a utility <http://unicode.org/cldr/utility/confusables.jsp>__ to play with these confusable homoglyphs.

Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.

  • Allo and ρττ are fine: single script.
  • AlloΓ is fine when our preferred script alias is 'latin': mixed script, but Γ is not confusable.
  • Alloρ is dangerous: mixed script and ρ could be confused with p.

This library is compatible Python 2 and Python 3.

API documentation <http://confusable-homoglyphs.readthedocs.io/en/latest/apidocumentation.html>__

Is the data up to date?

Yep.

The unicode blocks aliases and names for each character are extracted from this file <http://www.unicode.org/Public/UNIDATA/Scripts.txt>__ provided by the unicode consortium.

The matrix of which character can be confused with which other characters is built using this file <http://www.unicode.org/Public/security/latest/confusables.txt>__ provided by the unicode consortium.

This data is stored in two JSON files: categories.json and confusables.json. If you delete them, they will both be recreated by downloading and parsing the two abovementioned files and stored as JSON files again.

History

1.0.0

Initial release.

2.0.0

  • allowed_categories renamed to allowed_aliases

2.0.1

  • Fix a TypeError: https://github.com/vhf/confusable_homoglyphs/pull/2

3.0.0

Courtesy of Ryan P Kilby, via https://github.com/vhf/confusable_homoglyphs/pull/6 :

  • Changed file paths to be relative to the confusable_homoglyphs package directory instead of the user's current working directory.
  • Data files are now distributed with the packaging.
  • Fixes tests so that they use the installed distribution instead of the local files. (Originally, the data files were erroneously showing up during testing, despite not being included in the distribution).
  • Moves the data file generation into a simple CLI. This way, users have a method for controlling when the data files are updated.
  • Since the data files are now included in the distribution, the CLI is made optional. Its dependencies can be installed with the cli bundle, eg. pip install confusable_homoglyphs[cli].