Table of Contents
Fetching ...

Tucano 2 Cool: Better Open Source LLMs for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner, Lucie Flek

TL;DR

This work designs both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks, and extends and refine the evaluation harness, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes.

Abstract

We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.

Tucano 2 Cool: Better Open Source LLMs for Portuguese

TL;DR

This work designs both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks, and extends and refine the evaluation harness, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes.

Abstract

We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.
Paper Structure (215 sections, 4 equations, 36 figures, 65 tables)

This paper contains 215 sections, 4 equations, 36 figures, 65 tables.

Figures (36)

  • Figure 1: Impact of Educational & Synthetic Data (46B tokens). The Edu+Synth and Edu mixtures achieve the best performance across benchmarks, substantially outperforming the Non-Edu mixture and the Tucano-2b4 baseline. The percentage values represent the relative increase/decrease in performance with regard to the Tucano-2b4 baseline.
  • Figure 2: Pretraining loss curve across 195,000 steps ($\sim$408B tokens).
  • Figure 3: Normalized Preferred Metric (NPM) scores on the Easy Set evaluations.
  • Figure 4: Comparison of our 0.6-Base model against Tucano-1b1 and Curió-edu-1b1.
  • Figure 5: Per-benchmark performance comparison. Bars indicate the absolute difference in evaluation scores between the continually pretrained model and its base.
  • ...and 31 more figures