Table of Contents
Fetching ...

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

DatologyAI, :, Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

TL;DR

Targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling, and these benefits extend to frontier model scale.

Abstract

Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the "curse of multilinguality". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

TL;DR

Targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling, and these benefits extend to frontier model scale.

Abstract

Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the "curse of multilinguality". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.
Paper Structure (24 sections, 1 equation, 9 figures, 16 tables)

This paper contains 24 sections, 1 equation, 9 figures, 16 tables.

Figures (9)

  • Figure 1: A new compute-performance Pareto frontier for English and multilingual capabilities. We report error rate (log-scale; 1-accuracy, lower is better) as a function of training FLOPs (log-scale) across English (MMLU+ARC) and three multilingual benchmarks (Multilingual MMLU, Multilingual ARC, and Belebele). All evaluations use a multiple-choice format; multilingual scores are averaged over 13 languages. The shaded gray region summarizes the performance–compute envelope of representative open-weight baselines (e.g., Qwen3-4B/8B, Granite-4.0-3B). DatologyAI models occupy the bottom-left region relative to these baselines, indicating substantially lower multilingual error at reduced compute. We restrict our English-language evaluations to MMLU and ARC-Challenge for parity with the multilingual evaluations, and reserve comprehensive English and quantitative benchmarking for forthcoming companion releases.
  • Figure 2: Impact of Curation Strategy on Multilingual Performance (bilingual models). Performance comparison for 3B parameter models trained on 60BT tokens (50:50 English:non-English ratio). Results are averaged across multilingual MMLU, ARC, and Belebele. Across 13 languages, results show that improved English curation (light blue bars) consistently improves performance over the uncurated baseline (dark purple bars; improvement in 12 of 13 languages), while combining curated English with curated multilingual data (dark blue bars) yields the highest average scores across all languages.
  • Figure 3: Correlation between language similarity to English and cross-lingual transfer benefit. We evaluate linguistic distance using two proxies: (a) average log embedding distance across LaBSE, e5-small, and sentence-transformers, and (b) log perplexity of the target language under an English-only model. Both metrics show a significant negative correlation (Pearson $r=-0.62$ and $r=-0.70$ respectively) with the performance uplift gained from English data curation. These results demonstrate that linguistically similar languages, such as Spanish and French, receive the most pronounced benefits from high-quality English data, while more distant languages like Bengali and Arabic show significantly lower transfer gains.
  • Figure 4: Non-English Curation Benefits English Performance. Performance comparison for 3B parameter models trained on 60BT tokens (50:50 English:non-English ratio). Results are average of English MMLU and ARC. We contrast performance when when the accompanying multilingual data is uncurated (dark purple) versus curated (dark blue). We observe positive transfer in 12 out of 13 languages, with an overall relative improvement of 1.21%.
  • Figure 5: Evaluation of benefits associated with Random vs Scored Translation for Low-Resource Languages. Performance curves for Hindi, Bengali and Arabic showing that while augmenting training data with translated English text (red and cyan lines) improves over the uncurated baseline (dark gray), it still falls short of the performance achieved by bespoke DatologyAI curation (dark blue).
  • ...and 4 more figures