Tokenization Bench

Random Word

Tokenized Results

Tokenizer Vocab Size Tokens Token Count
IP.appify - umtoken: space + case 64k* dormenti 2
IP.appify - umtoken: + suffix rules 64k* dorm+enti 1
IP.appify - umtoken: + morph ops 64k* dorm+enti 1
OpenAI - GPT4o 200k dormenti 3
Google - Gemini 1.5 Pro 256k dormenti 2
Meta - Llama4 200k dormenti 3

If the tokens contain unexpected characters or hexadecimal codes, this is not an error. It is the way in which the respective tokenizer encodes non-ASCII characters.

umtoken is trained on the wikimedia/wikipedia dataset for the 8 most commonly spoken languages in the EU (de, en, es, fr, it, nl, pl, ro).
For more information on umtoken and an explanation of its levels, please visit us on GitHub.

* Here, 'k' denotes a factor of 1024, not 1000.