| Tokenizer | Vocab Size | Tokens | Token Count |
|---|---|---|---|
| IP.appify - umtoken: level 1 (space + case)IP.appify - umtoken: space + case | 64k* | dormenti | 2 |
| IP.appify - umtoken: level 2 (+ suffix rules)IP.appify - umtoken: + suffix rules | 64k* | dorm+enti | 1 |
| IP.appify - umtoken: level 3 (+ morph ops)IP.appify - umtoken: + morph ops | 64k* | dorm+enti | 1 |
| OpenAI - GPT4o (o200k_base)OpenAI - GPT4o | 200k | dormenti | 3 |
| Google - Gemini (gemini-1.5-pro-002)Google - Gemini 1.5 Pro | 256k | dormenti | 2 |
| Meta - Llama4 (Maverick)Meta - Llama4 | 200k | dormenti | 3 |
If the tokens contain unexpected characters or hexadecimal codes, this is not an error. It is the way in which the respective tokenizer encodes non-ASCII characters.
umtoken is trained on the wikimedia/wikipedia dataset for the 8 most commonly spoken languages in the EU (de, en, es, fr, it, nl, pl, ro).
For more information on umtoken and an explanation of its levels, please visit us on GitHub.
* Here, 'k' denotes a factor of 1024, not 1000.