EuroLLM: Multilingual Language Models for Europe
Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins
TL;DR
<3-5 sentence high-level summary> EuroLLM addresses the English-centric bias in open-weight LLMs by building multilingual, open-weight models covering all EU official languages and additional languages. The authors combine a data-collection/filtering pipeline across four data categories, a large-vocabulary multilingual tokenizer, and a data-mix strategy guided by joint scaling laws, culminating in pre-training of EuroLLM-1.7B and instruction-tuned EuroLLM-1.7B-Instruct via EuroBlocks. They demonstrate robust multilingual performance and competitive machine translation across standard benchmarks, supported by experiments comparing learning-rate schedulers and data-annealing effects. The work lays a groundwork for scalable, Europe-focused multilingual LLMs with practical instruction-following capabilities and opens avenues for further scaling and data-quality improvements.
Abstract
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
