EuroLLM-22B: Technical Report

Miguel Moura Ramos; Duarte M. Alves; Hippolyte Gisserot-Boukhlef; João Alves; Pedro Henrique Martins; Patrick Fernandes; José Pombal; Nuno M. Guerreiro; Ricardo Rei; Nicolas Boizard; Amin Farajian; Mateusz Klimaszewski; José G. C. de Souza; Barry Haddow; François Yvon; Pierre Colombo; Alexandra Birch; André F. T. Martins

EuroLLM-22B: Technical Report

Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, André F. T. Martins

TL;DR

EuroLLM-22B targets European language equity by building an open, multilingual LLM that natively supports $24$ EU languages plus $11$ extra languages. It combines a Megatron-LM–based architecture with RoPE, RMSNorm, and SwiGLU, trains on a carefully filtered, multi-source dataset, and extends the context length to $32{,}768$ tokens via a three-phase curriculum culminating in a high-quality post-training regime using EuroBlocks-SFT-2512. The result is a model that achieves competitive performance among open models of similar size, with notable gains in instruction-following and multilingual reasoning while maintaining translation quality, and it is complemented by extensive releases of data, base/instruct models, and evaluation tools to support reproducibility and EU AI research. The work demonstrates that high-quality multilingual data, targeted post-training, and long-context modeling can yield robust European-language capabilities at a relatively modest pre-training token budget, contributing a practical foundation for European AI development and deployment.

Abstract

This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.

EuroLLM-22B: Technical Report

TL;DR

EuroLLM-22B targets European language equity by building an open, multilingual LLM that natively supports

EU languages plus

extra languages. It combines a Megatron-LM–based architecture with RoPE, RMSNorm, and SwiGLU, trains on a carefully filtered, multi-source dataset, and extends the context length to

tokens via a three-phase curriculum culminating in a high-quality post-training regime using EuroBlocks-SFT-2512. The result is a model that achieves competitive performance among open models of similar size, with notable gains in instruction-following and multilingual reasoning while maintaining translation quality, and it is complemented by extensive releases of data, base/instruct models, and evaluation tools to support reproducibility and EU AI research. The work demonstrates that high-quality multilingual data, targeted post-training, and long-context modeling can yield robust European-language capabilities at a relatively modest pre-training token budget, contributing a practical foundation for European AI development and deployment.

Abstract

Paper Structure (50 sections, 2 figures, 29 tables)

This paper contains 50 sections, 2 figures, 29 tables.

Introduction
Pre-training
Modeling
Training Phases
Dataset
English Web Data.
Multilingual Web Data.
Parallel Data.
Code / Math Data.
Synthetic Math Data.
Higher-quality Data.
Long-context data.
Post Training
Data
Supervised fine-tuning
...and 35 more sections

Figures (2)

Figure 1: Scheme of the learning rate scheduler.
Figure 2: Language-wise percentage of the post-training corpus, excluding code/math/STEM data. English comprises 60% of the total data, multilingual content 20%, and code/math/STEM data 20%.

EuroLLM-22B: Technical Report

TL;DR

Abstract

EuroLLM-22B: Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (2)