Project: curated-tokenizers

Lightweight piece tokenization library

Project Details

Latest version: 0.9.1
Home Page: https://github.com/explosion/curated-tokenizers
PyPI Page: https://pypi.org/project/curated-tokenizers/

Project Popularity

PageRank: 0.003129522865133755
Number of downloads: 36847

🥢 Curated Tokenizers

This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:

Tokenizer	Binding	Example model
BPE	sentencepiece
Byte BPE	Native	RoBERTa/GPT-2
Unigram	sentencepiece	XLM-RoBERTa
Wordpiece	Native	BERT

⚠️ Warning: experimental package

This package is experimental and it is likely that the APIs will change in incompatible ways.

⏳ Install

Curated tokenizers is availble through PyPI:

pip install curated_tokenizers

🚀 Quickstart

The best way to get started with curated tokenizers is through the curated-transformers library. curated-transformers also provides functionality to load tokenization models from Huggingface Hub.

Related Projects

syntok

docker-image-py

textile

validator-collection

segtok

spacy-curated-transformers

iocextract

parsimonious

curated-transformers