Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari
TL;DR
This paper tackles the limited diagnostic power of fertility for multilingual tokenization by introducing STRR, a type-level metric that measures the proportion of words preserved as single tokens: $\mathrm{STRR}(T;W)=\frac{1}{n}\sum_{i=1}^n \mathbbm{1}\!\left(|T(w_i)|=1\right)\times 100$. It conducts a cross-linguistic evaluation of six LLM tokenizers across seven languages and two domains, comparing STRR with fertility and related metrics. Key findings show strong English and Chinese support and pronounced fragmentation for Hindi, revealing cross-lingual allocation biases not captured by fertility alone. The paper also offers actionable recommendations, including core vocabulary Pareto-based prioritization and an end-to-end vocabulary-expansion pipeline with public word lists and release of code, to guide more equitable and efficient multilingual tokenizers.
Abstract
Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.
