A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Catherine Arnett; Tyler A. Chang; Benjamin K. Bergen

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

TL;DR

To address cross-language comparisons of text data, this work defines the byte premium BP_A as the ratio $BP_A = BP_{A/C}$ of UTF-8 bytes needed to encode content-matched text in language A relative to a reference language C (English). BP_A is estimated from multiple parallel corpora (e.g., NLLB, FLORES, Bible) and extended to novel languages via regression, with a decomposition $BP_A = (Bytes_A/Chars_A) \times (Chars_A/Chars_C) \times (Chars_C/Bytes_C)$. The study demonstrates high cross-dataset consistency ($r>0.90$) and that compression with gzip preserves much of the relative variance while reducing absolute magnitudes. A Python tool is released to compute or predict BP for any language pair, enabling more equitable data practices in multilingual model development. Overall, the framework provides a practical, data-backed method to normalize dataset size measurements across languages and highlights implications for storage, bandwidth, and tokenization in low-resource contexts.

Abstract

How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

TL;DR

To address cross-language comparisons of text data, this work defines the byte premium BP_A as the ratio

of UTF-8 bytes needed to encode content-matched text in language A relative to a reference language C (English). BP_A is estimated from multiple parallel corpora (e.g., NLLB, FLORES, Bible) and extended to novel languages via regression, with a decomposition

. The study demonstrates high cross-dataset consistency (

) and that compression with gzip preserves much of the relative variance while reducing absolute magnitudes. A Python tool is released to compute or predict BP for any language pair, enabling more equitable data practices in multilingual model development. Overall, the framework provides a practical, data-backed method to normalize dataset size measurements across languages and highlights implications for storage, bandwidth, and tokenization in low-resource contexts.

Abstract

Paper Structure (28 sections, 3 equations, 1 figure, 4 tables)

This paper contains 28 sections, 3 equations, 1 figure, 4 tables.

Introduction
Related Work
Computing Byte Premiums
NLLB
Other Parallel Corpora
Byte Premiums After Compression
Predicting Novel Byte Premiums
Predicting Length Ratios
Language Family
Script and Script Type
Character Entropy
Evaluating Byte Premium Predictions
Introducing the Tool
Discussion and Conclusion
Measuring Dataset Sizes
...and 13 more sections

Figures (1)

Figure B.1: Byte premiums before and after compression by $\texttt{gzip}$. Each point is a language's byte premium relative to English.

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

TL;DR

Abstract

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (1)