Table of Contents
Fetching ...

Salamandra Technical Report

Aitor Gonzalez-Agirre, Marc Pàmies, Joan Llop, Irene Baucells, Severino Da Dalt, Daniel Tamayo, José Javier Saiz, Ferran Espuña, Jaume Prats, Javier Aula-Blasco, Mario Mina, Iñigo Pikabea, Adrián Rubio, Alexander Shvets, Anna Sallés, Iñaki Lacunza, Jorge Palomar, Júlia Falcão, Lucía Tormo, Luis Vasquez-Reina, Montserrat Marimon, Oriol Pareras, Valle Ruiz-Fernández, Marta Villegas

TL;DR

Salamandra presents a suite of open-source decoder-only LLMs (2B, 7B, 40B) trained from scratch on 35 European languages plus code, with public instruction-tuned variants and preliminary multimodal capabilities. The authors detail the architecture (RoPE, SwiGLU, RMSNorm, FlashAttention), a large 256k vocabulary tokenizer, and a balanced multilingual pretraining corpus assembled from curated and web sources, processed with Ungoliant and CURATE pipelines. Post-training includes instruction tuning and vision-language experiments, with comprehensive multilingual evaluation via IberoBench and LM Evaluation Harness, augmented by LLM-as-a-Judge prompts. Safety, bias, and ethics are examined through BBQ/EsBBQ benchmarks, regard analysis, cognitive bias assessment, and multilingual red-teaming using Llama Guard 3, highlighting both progress and remaining gaps in multilingual safety. Overall, Salamandra advances open multilingual LLM research and provides a framework for future improvements in alignment, safety, and multimodal capabilities for European languages.

Abstract

This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.

Salamandra Technical Report

TL;DR

Salamandra presents a suite of open-source decoder-only LLMs (2B, 7B, 40B) trained from scratch on 35 European languages plus code, with public instruction-tuned variants and preliminary multimodal capabilities. The authors detail the architecture (RoPE, SwiGLU, RMSNorm, FlashAttention), a large 256k vocabulary tokenizer, and a balanced multilingual pretraining corpus assembled from curated and web sources, processed with Ungoliant and CURATE pipelines. Post-training includes instruction tuning and vision-language experiments, with comprehensive multilingual evaluation via IberoBench and LM Evaluation Harness, augmented by LLM-as-a-Judge prompts. Safety, bias, and ethics are examined through BBQ/EsBBQ benchmarks, regard analysis, cognitive bias assessment, and multilingual red-teaming using Llama Guard 3, highlighting both progress and remaining gaps in multilingual safety. Overall, Salamandra advances open multilingual LLM research and provides a framework for future improvements in alignment, safety, and multimodal capabilities for European languages.

Abstract

This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.

Paper Structure

This paper contains 98 sections, 7 equations, 22 figures, 27 tables, 1 algorithm.

Figures (22)

  • Figure 1: Comparison of tokenizer fertility (i.e. tokens-per-word) across multiple languages: Catalan, Greek, English, Spanish, Basque, Finnish, Irish, Galician, Lithuanian and Russian. The horizontal lines show the fertility of a monolingual tokenizer with a vocabulary size of 50k tokens.
  • Figure 2: Fertility score of a tokenizer trained on a balanced dataset where each language is represented equally (i.e. Uniform distribution), compared to a tokenizer that has been trained on a random subsample of data from the training corpora (i.e. Non-uniform distribution). The horizontal lines show the fertility of a monolingual tokenizer with 50k tokens of vocabulary.
  • Figure 3: Distribution of sources in the Salamandra pre-training dataset. Each data point represents a source, with colours indicating the type and circle size indicating the relative number of words. The logarithmic scale is used to capture variability in dataset size, which spans several orders of magnitude, so that smaller significant sources remain visible alongside larger datasets. Sources with less than 1% of the words are listed in the lower right text box for completeness.
  • Figure 4: Distribution of tokens in the pre-training and continued training phase corpus after applying epoch sampling. The languages are grouped under families, represented with the ISO 639-1 codes.
  • Figure 5: Overview of data distribution in visual instruction tuning phases. In total, the dataset contains 6.1 million instances, of which 842,000 are text-only. (Left) Language distribution in the text-only dataset. (Center) Distribution of multimodal versus text-only data. (Right) Distribution of task types across the multimodal dataset.
  • ...and 17 more figures