Table of Contents
Fetching ...

Multilingual Language Model Pretraining using Machine-translated Data

Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, David Adelani, Yihong Chen, Raphael Tang, Pontus Stenetorp

TL;DR

This work tackles multilingual LLM data scarcity by translating a high-quality English pretraining corpus (FineWeb-Edu) into nine languages to form TransWebEdu (~$1.7$T tokens). A 1.3B-parameter model, TransWebLLM, is pretrained from scratch on this data and evaluated across nine non-English reasoning benchmarks, achieving state-of-the-art or competitive results with far fewer tokens than closed-data rivals. The authors further show that incorporating <5% TransWebEdu as domain-specific data yields new SOTA results in several languages, and that adding limited general web data, rephrased synthetic data, and cooldown data can enhance performance further (notably for Swahili, Indonesian, and French). By releasing the dataset, models, and training pipeline under open licenses, this work provides a scalable, reproducible approach to multilingual pretraining that improves coverage for medium- and low-resource languages and broadens the applicability of multilingual NLP.

Abstract

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.

Multilingual Language Model Pretraining using Machine-translated Data

TL;DR

This work tackles multilingual LLM data scarcity by translating a high-quality English pretraining corpus (FineWeb-Edu) into nine languages to form TransWebEdu (~T tokens). A 1.3B-parameter model, TransWebLLM, is pretrained from scratch on this data and evaluated across nine non-English reasoning benchmarks, achieving state-of-the-art or competitive results with far fewer tokens than closed-data rivals. The authors further show that incorporating <5% TransWebEdu as domain-specific data yields new SOTA results in several languages, and that adding limited general web data, rephrased synthetic data, and cooldown data can enhance performance further (notably for Swahili, Indonesian, and French). By releasing the dataset, models, and training pipeline under open licenses, this work provides a scalable, reproducible approach to multilingual pretraining that improves coverage for medium- and low-resource languages and broadens the applicability of multilingual NLP.

Abstract

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.

Paper Structure

This paper contains 34 sections, 3 figures, 26 tables.

Figures (3)

  • Figure 1: Step-by-step illustration of the translation pipeline to obtain TransWebEdu.
  • Figure 2: Step-by-step illustration of the translation pipeline with the Mistral-7B-Instruct model.
  • Figure 3: Chat template used for prompting Mistral-7B-Instruct-v0.1 for English-French translation.