Table of Contents
Fetching ...

Tiny Aya: Bridging Scale and Multilingual Depth

Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, Phil Blunsom, Nick Frosst, Joelle Pineau, Beyza Ermis, Ahmet Üstün, Julia Kreutzer, Marzieh Fadaee

Abstract

Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.

Tiny Aya: Bridging Scale and Multilingual Depth

Abstract

Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.
Paper Structure (43 sections, 1 equation, 15 figures, 26 tables)

This paper contains 43 sections, 1 equation, 15 figures, 26 tables.

Figures (15)

  • Figure 1: Benchmark performance across regions. The Tiny Aya model family performs competitively across languages, regions and multilingual benchmark tasks. Comparing the Tiny Aya model that scores best for each region with similar-sized competitors aggregated across multiple massively multilingual benchmarks for a diverse set of tasks (mDolly, mArenaHard, GlobalMGSM, Flores, GlobalMMLU), we find that Tiny Aya advances the state of the art for languages from West Asia and Africa.
  • Figure 2: Posttraining pipeline and model construction. Starting from Tiny Aya Base, we run region-specific supervised finetuning on five regional data subsets and tune the final regional mixtures. In parallel, we train a global supervised fine-tuned model over all regions with minimal alignment. Each region model is then merged with the global model to produce the final region-specialized releases.
  • Figure 3: Tokenization efficiency across scripts. We report the average tokens per character for each script, comparing the Tiny Aya tokenizer with Gemma3-4B, Qwen3-4B, and SmolLM3-3B tokenizers. Scripts are sorted by total stacked height from smallest to largest. The label beneath each script reports the number of languages using that script in Tiny Aya. Lower values indicate better tokenization efficiency. Our tokenizer (green) achieves competitive or superior compression across most scripts, with particularly strong performance on scripts underserved by other models such as Khmer, Telugu, Gujarati, Lao, and Ge'ez.
  • Figure 4: Regional composition of posttraining data clusters. Share of posttraining data drawn from each region for each cluster mixture used to train region specific SFT models. These SFT models are later used for merging as shown in Figure \ref{['fig:creation']}. English and code are present in all clusters, and the remaining proportions reflect region- and language-level dataset availability.
  • Figure 5: Open-ended generation quality versus web presence. mDolly judge scores plotted against an approximate web-presence proxy based on Common Crawl bucketed into five equal-width bins. The trend highlights robustness in lower-web-presence languages relative to same-scale competitors.
  • ...and 10 more figures