Table of Contents
Fetching ...

Training Bilingual LMs with Data Constraints in the Targeted Language

Skyler Seto, Maartje ter Hoeve, Richard He Bai, Natalie Schluter, David Grangier

TL;DR

This work tackles pretraining language models for data-constrained target languages by leveraging abundant high-quality English data as an auxiliary source. It systematically analyzes data selection, filtering, topic-focused upsampling, synthetic data generation, and translation-based augmentation to understand when and why auxiliary data helps and how language distance affects transfer. Across German and multiple other languages, the study finds that higher-quality English data can boost target-language performance, with gains driven largely by information present in the auxiliary data rather than merely data quality; improvements are inconsistent across languages, suggesting stronger similarity between languages yields larger benefits. The results reveal practical scaling limits: target data must scale with model size, and the advantages of auxiliary data diminish when target data remains severely limited, highlighting a nuanced path for pretraining bilingual LMs in low-resource settings. Overall, the work provides a framework and concrete findings to guide bilingual pretraining under data constraints and motivates further exploration of language-aware data strategies and evaluation methods in multilingual LLMs.

Abstract

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model, by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.

Training Bilingual LMs with Data Constraints in the Targeted Language

TL;DR

This work tackles pretraining language models for data-constrained target languages by leveraging abundant high-quality English data as an auxiliary source. It systematically analyzes data selection, filtering, topic-focused upsampling, synthetic data generation, and translation-based augmentation to understand when and why auxiliary data helps and how language distance affects transfer. Across German and multiple other languages, the study finds that higher-quality English data can boost target-language performance, with gains driven largely by information present in the auxiliary data rather than merely data quality; improvements are inconsistent across languages, suggesting stronger similarity between languages yields larger benefits. The results reveal practical scaling limits: target data must scale with model size, and the advantages of auxiliary data diminish when target data remains severely limited, highlighting a nuanced path for pretraining bilingual LMs in low-resource settings. Overall, the work provides a framework and concrete findings to guide bilingual pretraining under data constraints and motivates further exploration of language-aware data strategies and evaluation methods in multilingual LLMs.

Abstract

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model, by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.

Paper Structure

This paper contains 67 sections, 1 equation, 22 figures, 15 tables.

Figures (22)

  • Figure 1: (a) Data Pipeline: English data pipeline used for building large pretraining corpora in penedo2024fineweb. (b) Auxiliary Data Pretraining: Combining high quality domain-specific pretraining data with a small amount of data from the target language for pretraining with limited target data. (c) Data Transforms: Many considerations when building datasets in languages with limited data.
  • Figure 2: Zero-shot accuracy of models trained with higher quality English auxiliary data. Results are averaged over six eval datasets. We compare training with different auxiliary datasets on English and German evaluations. Better English datasets show large increases in English and smaller increases in German.
  • Figure 3: Average zero-shot accuracy in the target language summarized for eight languages. Models trained on 100B tokens. Comparisons between a small and large amount of monolingual data from the target language, a small amount of data from the target language and mC4 English data (same distribution), and a small amount of data from the target language and FineWebEDU.
  • Figure 4: Average accuracy over zero-shot benchmark tasks in translated Japanese, comparing Chinese and English auxiliary data.
  • Figure 5: Zero-shot accuracy of models trained with model based filtering of English auxiliary data. Results are averaged over six evaluation datasets. For each setting evaluation is done in English and German.
  • ...and 17 more figures