Lightweight piece tokenization library
This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:
| Tokenizer | Binding | Example model |
|---|---|---|
| BPE | sentencepiece | |
| Byte BPE | Native | RoBERTa/GPT-2 |
| Unigram | sentencepiece | XLM-RoBERTa |
| Wordpiece | Native | BERT |
This package is experimental and it is likely that the APIs will change in incompatible ways.
Curated tokenizers is availble through PyPI:
pip install curated_tokenizers
The best way to get started with curated tokenizers is through the
curated-transformers
library. curated-transformers also provides functionality to load tokenization
models from Huggingface Hub.