Project: curated-tokenizers

Lightweight piece tokenization library

Project Details

Latest version
0.9.1
Home Page
https://github.com/explosion/curated-tokenizers
PyPI Page
https://pypi.org/project/curated-tokenizers/

Project Popularity

PageRank
0.003129522865133755
Number of downloads
36847

🥢 Curated Tokenizers

This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:

Tokenizer Binding Example model
BPE sentencepiece
Byte BPE Native RoBERTa/GPT-2
Unigram sentencepiece XLM-RoBERTa
Wordpiece Native BERT

⚠️ Warning: experimental package

This package is experimental and it is likely that the APIs will change in incompatible ways.

⏳ Install

Curated tokenizers is availble through PyPI:

pip install curated_tokenizers

🚀 Quickstart

The best way to get started with curated tokenizers is through the curated-transformers library. curated-transformers also provides functionality to load tokenization models from Huggingface Hub.