Table of Contents
Fetching ...

EuroLLM: Multilingual Language Models for Europe

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

TL;DR

<3-5 sentence high-level summary> EuroLLM addresses the English-centric bias in open-weight LLMs by building multilingual, open-weight models covering all EU official languages and additional languages. The authors combine a data-collection/filtering pipeline across four data categories, a large-vocabulary multilingual tokenizer, and a data-mix strategy guided by joint scaling laws, culminating in pre-training of EuroLLM-1.7B and instruction-tuned EuroLLM-1.7B-Instruct via EuroBlocks. They demonstrate robust multilingual performance and competitive machine translation across standard benchmarks, supported by experiments comparing learning-rate schedulers and data-annealing effects. The work lays a groundwork for scalable, Europe-focused multilingual LLMs with practical instruction-following capabilities and opens avenues for further scaling and data-quality improvements.

Abstract

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

EuroLLM: Multilingual Language Models for Europe

TL;DR

<3-5 sentence high-level summary> EuroLLM addresses the English-centric bias in open-weight LLMs by building multilingual, open-weight models covering all EU official languages and additional languages. The authors combine a data-collection/filtering pipeline across four data categories, a large-vocabulary multilingual tokenizer, and a data-mix strategy guided by joint scaling laws, culminating in pre-training of EuroLLM-1.7B and instruction-tuned EuroLLM-1.7B-Instruct via EuroBlocks. They demonstrate robust multilingual performance and competitive machine translation across standard benchmarks, supported by experiments comparing learning-rate schedulers and data-annealing effects. The work lays a groundwork for scalable, Europe-focused multilingual LLMs with practical instruction-following capabilities and opens avenues for further scaling and data-quality improvements.

Abstract

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
Paper Structure (25 sections, 8 figures, 3 tables)

This paper contains 25 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Percentage attributed to each data category in the first training phase (left) and annealing phase (right).
  • Figure 2: Joint Scaling laws obtained when varying the percentage of parallel data.
  • Figure 3: Joint Scaling laws obtained when repeating vs not-repeating Wikipedia.
  • Figure 4: Percentage of the training corpus attributed to each language, excluding English which accounts to 50% in the first phase and 32.5% during annealing. 5% of the corpus is left for datasets composed of code and math in the first phase and 7% during annealing.
  • Figure 5: Fertility (pieces / word) obtained with the Mistral, LLaMa-3, Gemma, and EuroLLM tokenizers for a subset of the EuroLLM languages.
  • ...and 3 more figures