EuroLLM: Multilingual Language Models for Europe

Pedro Henrique Martins; Patrick Fernandes; João Alves; Nuno M. Guerreiro; Ricardo Rei; Duarte M. Alves; José Pombal; Amin Farajian; Manuel Faysse; Mateusz Klimaszewski; Pierre Colombo; Barry Haddow; José G. C. de Souza; Alexandra Birch; André F. T. Martins

EuroLLM: Multilingual Language Models for Europe

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

TL;DR

<3-5 sentence high-level summary> EuroLLM addresses the English-centric bias in open-weight LLMs by building multilingual, open-weight models covering all EU official languages and additional languages. The authors combine a data-collection/filtering pipeline across four data categories, a large-vocabulary multilingual tokenizer, and a data-mix strategy guided by joint scaling laws, culminating in pre-training of EuroLLM-1.7B and instruction-tuned EuroLLM-1.7B-Instruct via EuroBlocks. They demonstrate robust multilingual performance and competitive machine translation across standard benchmarks, supported by experiments comparing learning-rate schedulers and data-annealing effects. The work lays a groundwork for scalable, Europe-focused multilingual LLMs with practical instruction-following capabilities and opens avenues for further scaling and data-quality improvements.

Abstract

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

EuroLLM: Multilingual Language Models for Europe

TL;DR

Abstract

Paper Structure (25 sections, 8 figures, 3 tables)

This paper contains 25 sections, 8 figures, 3 tables.

Introduction
Data
Data Collection and Filtering
Web Data
Parallel Data
Code / Math Data
High-quality Data
Annealing Data
Data Mixture
Parallel Data
Joint Scaling Laws
Repeating High Quality Data
Division between Languages
Tokenizer
Modeling
...and 10 more sections

Figures (8)

Figure 1: Percentage attributed to each data category in the first training phase (left) and annealing phase (right).
Figure 2: Joint Scaling laws obtained when varying the percentage of parallel data.
Figure 3: Joint Scaling laws obtained when repeating vs not-repeating Wikipedia.
Figure 4: Percentage of the training corpus attributed to each language, excluding English which accounts to 50% in the first phase and 32.5% during annealing. 5% of the corpus is left for datasets composed of code and math in the first phase and 7% during annealing.
Figure 5: Fertility (pieces / word) obtained with the Mistral, LLaMa-3, Gemma, and EuroLLM tokenizers for a subset of the EuroLLM languages.
...and 3 more figures

EuroLLM: Multilingual Language Models for Europe

TL;DR

Abstract

EuroLLM: Multilingual Language Models for Europe

Authors

TL;DR

Abstract

Table of Contents

Figures (8)