Lightweight piece tokenization library
This Python library provides word-/sentencepiece tokenizers. The following types of tokenizers are currenty supported:
Tokenizer | Binding | Example model |
---|---|---|
BPE | sentencepiece | |
Byte BPE | Native | RoBERTa/GPT-2 |
Unigram | sentencepiece | XLM-RoBERTa |
Wordpiece | Native | BERT |
This package is experimental and it is likely that the APIs will change in incompatible ways.
Curated tokenizers is availble through PyPI:
pip install curated_tokenizers
The best way to get started with curated tokenizers is through the
curated-transformers
library. curated-transformers
also provides functionality to load tokenization
models from Huggingface Hub.