A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen
TL;DR
To address cross-language comparisons of text data, this work defines the byte premium BP_A as the ratio $BP_A = BP_{A/C}$ of UTF-8 bytes needed to encode content-matched text in language A relative to a reference language C (English). BP_A is estimated from multiple parallel corpora (e.g., NLLB, FLORES, Bible) and extended to novel languages via regression, with a decomposition $BP_A = (Bytes_A/Chars_A) \times (Chars_A/Chars_C) \times (Chars_C/Bytes_C)$. The study demonstrates high cross-dataset consistency ($r>0.90$) and that compression with gzip preserves much of the relative variance while reducing absolute magnitudes. A Python tool is released to compute or predict BP for any language pair, enabling more equitable data practices in multilingual model development. Overall, the framework provides a practical, data-backed method to normalize dataset size measurements across languages and highlights implications for storage, bandwidth, and tokenization in low-resource contexts.
Abstract
How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.
