Table of Contents
Fetching ...

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

Aradhya Dixit, Shreem Dixit

TL;DR

The paper introduces the script tax, a tokenization-driven, script-dependent disparity in throughput and information efficiency for multilingual language models. It uses a paired-orthography evaluation to quantify tokenization fragmentation via fertility, information cost via bits per character (BPC), and compute cost via latency, revealing a ~3.4× fertility gap and a ~16.5× slowdown on identical hardware, with BPC rising by up to ~47% for XLM-R. The authors advocate script-aware tokenization and pretraining, arguing that tokenization choices encode systematic inequities that are not captured by token-level loss alone. They also provide robustness checks (round-trip CER) and emphasize reporting compute-aware metrics to better assess multilingual NLP performance and fairness.

Abstract

Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

TL;DR

The paper introduces the script tax, a tokenization-driven, script-dependent disparity in throughput and information efficiency for multilingual language models. It uses a paired-orthography evaluation to quantify tokenization fragmentation via fertility, information cost via bits per character (BPC), and compute cost via latency, revealing a ~3.4× fertility gap and a ~16.5× slowdown on identical hardware, with BPC rising by up to ~47% for XLM-R. The authors advocate script-aware tokenization and pretraining, arguing that tokenization choices encode systematic inequities that are not captured by token-level loss alone. They also provide robustness checks (round-trip CER) and emphasize reporting compute-aware metrics to better assess multilingual NLP performance and fairness.

Abstract

Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.
Paper Structure (19 sections, 5 equations, 3 figures)

This paper contains 19 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: Evaluation pipeline used to measure the script tax. We compare paired sentences across orthographic variants and compute (i) tokenization fertility (tokens/word), (ii) modeling efficiency via BPC (loss normalized by character count), and (iii) inference latency/throughput on identical hardware.
  • Figure 2: Tokenization bottleneck (fertility, tokens/word). The higher-fragmentation orthography requires substantially more tokens per word across both mBERT and XLM-R.
  • Figure 3: The script tax as a joint disparity in information cost and runtime. The higher-fragmentation orthography exhibits both higher BPC (worse; $\downarrow$ better) and substantially higher inference latency.