Table of Contents
Fetching ...

Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari

TL;DR

This paper tackles the limited diagnostic power of fertility for multilingual tokenization by introducing STRR, a type-level metric that measures the proportion of words preserved as single tokens: $\mathrm{STRR}(T;W)=\frac{1}{n}\sum_{i=1}^n \mathbbm{1}\!\left(|T(w_i)|=1\right)\times 100$. It conducts a cross-linguistic evaluation of six LLM tokenizers across seven languages and two domains, comparing STRR with fertility and related metrics. Key findings show strong English and Chinese support and pronounced fragmentation for Hindi, revealing cross-lingual allocation biases not captured by fertility alone. The paper also offers actionable recommendations, including core vocabulary Pareto-based prioritization and an end-to-end vocabulary-expansion pipeline with public word lists and release of code, to guide more equitable and efficient multilingual tokenizers.

Abstract

Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.

Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

TL;DR

This paper tackles the limited diagnostic power of fertility for multilingual tokenization by introducing STRR, a type-level metric that measures the proportion of words preserved as single tokens: . It conducts a cross-linguistic evaluation of six LLM tokenizers across seven languages and two domains, comparing STRR with fertility and related metrics. Key findings show strong English and Chinese support and pronounced fragmentation for Hindi, revealing cross-lingual allocation biases not captured by fertility alone. The paper also offers actionable recommendations, including core vocabulary Pareto-based prioritization and an end-to-end vocabulary-expansion pipeline with public word lists and release of code, to guide more equitable and efficient multilingual tokenizers.

Abstract

Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Single Token Retention Rate (STRR) across six LLM tokenizers for different language pairs. Each pair (e.g., English-French) represents 1,000 parallel words in both languages, allowing us to examine whether LLM tokenizers prioritize the English versions of words over their multilingual counterparts. A high STRR for English suggests that tokenizers allocate more vocabulary space to English, while differences in STRR across languages indicate varying degrees of support.