Tokenization Bench

Tokenized Results

Tokenizer	Vocab Size	Tokens	Token Count
IP.appify - umtoken: level 1 (space + case)IP.appify - umtoken: space + case	64k*	dormenti	2
IP.appify - umtoken: level 2 (+ suffix rules)IP.appify - umtoken: + suffix rules	64k*	dorm+enti	1
IP.appify - umtoken: level 3 (+ morph ops)IP.appify - umtoken: + morph ops	64k*	dorm+enti	1
OpenAI - GPT4o (o200k_base)OpenAI - GPT4o	200k	dormenti	3
Google - Gemini (gemini-1.5-pro-002)Google - Gemini 1.5 Pro	256k	dormenti	2
Meta - Llama4 (Maverick)Meta - Llama4	200k	dormenti	3

If the tokens contain unexpected characters or hexadecimal codes, this is not an error. It is the way in which the respective tokenizer encodes non-ASCII characters.

umtoken is trained on the wikimedia/wikipedia dataset for the 8 most commonly spoken languages in the EU (de, en, es, fr, it, nl, pl, ro).
For more information on umtoken and an explanation of its levels, please visit us on GitHub.

^* Here, 'k' denotes a factor of 1024, not 1000.