Table of Contents
Fetching ...

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Project Apertus, Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Inés Altemir Mariñas, Mohammad Hossein Amani, Matin Ansaripour, Ilia Badanin, Harold Benoit, Emanuela Boros, Nicholas Browning, Fabian Bösch, Maximilian Böther, Niklas Canova, Camille Challier, Clement Charmillot, Jonathan Coles, Jan Deriu, Arnout Devos, Lukas Drescher, Daniil Dzenhaliou, Maud Ehrmann, Dongyang Fan, Simin Fan, Silin Gao, Miguel Gila, María Grandury, Diba Hashemi, Alexander Hoyle, Jiaming Jiang, Mark Klein, Andrei Kucharavy, Anastasiia Kucherenko, Frederike Lübeck, Roman Machacek, Theofilos Manitaras, Andreas Marfurt, Kyle Matoba, Simon Matrenok, Henrique Mendonça, Fawzi Roberto Mohamed, Syrielle Montariol, Luca Mouchel, Sven Najem-Meyer, Jingwei Ni, Gennaro Oliva, Matteo Pagliardini, Elia Palme, Andrei Panferov, Léo Paoletti, Marco Passerini, Ivan Pavlov, Auguste Poiroux, Kaustubh Ponkshe, Nathan Ranchin, Javi Rando, Mathieu Sauser, Jakhongir Saydaliev, Muhammad Ali Sayfiddinov, Marian Schneider, Stefano Schuppli, Marco Scialanga, Andrei Semenov, Kumar Shridhar, Raghav Singhal, Anna Sotnikova, Alexander Sternfeld, Ayush Kumar Tarun, Paul Teiletche, Jannis Vamvas, Xiaozhe Yao, Hao Zhao, Alexander Ilic, Ana Klimovic, Andreas Krause, Caglar Gulcehre, David Rosenthal, Elliott Ash, Florian Tramèr, Joost VandeVondele, Livio Veraldi, Martin Rajman, Thomas Schulthess, Torsten Hoefler, Antoine Bosselut, Martin Jaggi, Imanol Schlag

TL;DR

Apertus tackles two core issues in open LLM ecosystems: data compliance and multilingual representation. It delivers a fully open suite with 8B and 70B models trained on 15T tokens from 1811 languages, using retroactive robots.txt opt-outs and the Goldfish objective to curb verbatim memorization, while achieving state-of-the-art performance among fully open models on multilingual benchmarks. The work also emphasizes complete transparency by releasing models, pipelines, code, and data licenses, enabling audits and extensibility, and includes a robust post-training phase with SFT and a constitutional alignment (Swiss AI Charter) using QRPO. The infrastructure relies on the Alps/CCS SCS ecosystem to scale training to 4096 GPUs, with meticulous data filtering, long-context capabilities, and extensive safety, fairness, and low-resource language evaluations, positioning Apertus as a benchmark for trustworthy, globally relevant open LLMs with broad accessibility and auditable provenance.

Abstract

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting `robots.txt` exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

TL;DR

Apertus tackles two core issues in open LLM ecosystems: data compliance and multilingual representation. It delivers a fully open suite with 8B and 70B models trained on 15T tokens from 1811 languages, using retroactive robots.txt opt-outs and the Goldfish objective to curb verbatim memorization, while achieving state-of-the-art performance among fully open models on multilingual benchmarks. The work also emphasizes complete transparency by releasing models, pipelines, code, and data licenses, enabling audits and extensibility, and includes a robust post-training phase with SFT and a constitutional alignment (Swiss AI Charter) using QRPO. The infrastructure relies on the Alps/CCS SCS ecosystem to scale training to 4096 GPUs, with meticulous data filtering, long-context capabilities, and extensive safety, fairness, and low-resource language evaluations, positioning Apertus as a benchmark for trustworthy, globally relevant open LLMs with broad accessibility and auditable provenance.

Abstract

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting `robots.txt` exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Paper Structure

This paper contains 176 sections, 13 equations, 17 figures, 38 tables, 1 algorithm.

Figures (17)

  • Figure 1: Intrinsic Evaluation of Four Multilingual Tokenizers. The Mistral-Nemo tokenizer consistently matches or outperforms other tokenizers in fertility rate, compression ratio, and vocabulary utilization, highlighting its strong overall efficiency. In addition, it achieves a lower Gini coefficient, indicating greater fairness by distributing tokenization costs more evenly across languages.
  • Figure 2: Baseline Comparison with Final Apertus Architecture. We merge all successful and intended changes to architecture and optimizer (xIELU activation, QK-Norm, AdEMAMix, WSD schedule with 1-sqrt annealing, cross-document attention, goldfish loss) into a 3B model, which we train for 100B tokens. Compared to a well-tuned baseline of a standard Llama model with cosine annealing, we achieve notable improvements in stability and gradient norms (right). Simultaneously, the model matches the final training loss of the baseline with 30-40% fewer tokens.
  • Figure 3: Pretraining Loss Curves and Gradient Norms. The entirety of pretraining was stable, without major loss spikes or rollbacks. This held true even with the doubling of the global batch size (GBS), as well as changes in data mixtures, which result in discontinuous loss jumps through the difference in average cross entropy. The different stages of data are described in Section \ref{['sec:pretraining_data']}; Phase 5 coincides with the learning rate cooldown. For the gradient norms, curves are smoothed with a running window of 500 steps (70B) and 1000 steps (8B). The gradient norms of the 70B are noticeably smaller. No smoothing is applied to the loss curves.
  • Figure 4: Distributions of Toxicity Scores in 9 Languages, when applying our classifiers to the Chinese, French, German, Italian, Dutch, Polish, Spanish, and Portuguese datasets from FineWeb-2 penedo2024fineweb-2 and English from FineWeb penedo2024finewebdatasetsdecantingweb. The 95% threshold is highlighted as High-Risk.
  • Figure 5: Relationships of our English pretraining datasets, which are all based on CommonCrawl dumps. Not true to scale in terms of token count.
  • ...and 12 more figures