Tokenization Bench

Random Word

Tokenized Results

Tokenizer Vocab Size Tokens Token Count
IP.appify - umtoken: space + case 64k* astrophysics 2
IP.appify - umtoken: + suffix rules 64k* astrophysic+s 2
IP.appify - umtoken: + morph ops 64k* astrophysic+s 2
OpenAI - GPT4o 200k astrophysics 3
Google - Gemini 1.5 Pro 256k astrophysics 2
Meta - Llama4 200k astrophysics 3

If the tokens contain unexpected characters or hexadecimal codes, this is not an error. It is the way in which the respective tokenizer encodes non-ASCII characters.

umtoken is trained on the wikimedia/wikipedia dataset for the 8 most commonly spoken languages in the EU (de, en, es, fr, it, nl, pl, ro).
For more information on umtoken and an explanation of its levels, please visit us on GitHub.

* Here, 'k' denotes a factor of 1024, not 1000.