Table of Contents
Fetching ...

Sabiá-3 Technical Report

Hugo Abonizio, Thales Sales Almeida, Thiago Laitz, Roseval Malaquias Junior, Giovana Kerche Bonás, Rodrigo Nogueira, Ramon Pires

TL;DR

This paper introduces Sabiá-3 and Sabiazinho-3, Brazil-focused language models trained on a large Brazilian Portuguese corpus to leverage domain-specific linguistic and cultural knowledge. The authors employ a two-phase training regime—pre-training on specialized data with next-token prediction followed by instruction tuning and preference alignment—implemented with TPU v5 and Jax to scale data and model parallelism. Results show Sabiá-3 achieves competitive performance with frontier LLMs on knowledge-intensive tasks at a substantially lower cost per token, while outperforming its predecessor in reasoning and long-context processing; however, it still trails top-tier models on multi-step tasks and some instruction-following benchmarks. The findings highlight the practical value of domain specialization for cost-effective, high-signal performance in Brazil-centric applications, and point to future work in enhancing multi-turn instruction-following and agentic capabilities.

Abstract

This report presents Sabiá-3, our new flagship language model, and Sabiazinho-3, a more cost-effective sibling. The models were trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabia-2 Medium, especially in reasoning-intensive tasks. Notably, Sabiá-3's average performance matches frontier LLMs, while it is offered at a three to four times lower cost per token, reinforcing the benefits of domain specialization.

Sabiá-3 Technical Report

TL;DR

This paper introduces Sabiá-3 and Sabiazinho-3, Brazil-focused language models trained on a large Brazilian Portuguese corpus to leverage domain-specific linguistic and cultural knowledge. The authors employ a two-phase training regime—pre-training on specialized data with next-token prediction followed by instruction tuning and preference alignment—implemented with TPU v5 and Jax to scale data and model parallelism. Results show Sabiá-3 achieves competitive performance with frontier LLMs on knowledge-intensive tasks at a substantially lower cost per token, while outperforming its predecessor in reasoning and long-context processing; however, it still trails top-tier models on multi-step tasks and some instruction-following benchmarks. The findings highlight the practical value of domain specialization for cost-effective, high-signal performance in Brazil-centric applications, and point to future work in enhancing multi-turn instruction-following and agentic capabilities.

Abstract

This report presents Sabiá-3, our new flagship language model, and Sabiazinho-3, a more cost-effective sibling. The models were trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabia-2 Medium, especially in reasoning-intensive tasks. Notably, Sabiá-3's average performance matches frontier LLMs, while it is offered at a three to four times lower cost per token, reinforcing the benefits of domain specialization.

Paper Structure

This paper contains 9 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Price in USD per million tokens, considering an equal proportion of input and output tokens, versus performance on 70 Brazilian exams (in Portuguese). The dashed curve represents a Pareto frontier for general-purpose LLMs such as Llama 3.1 and GPT-4o, which Sabiá-3 and Sabiazinho-3 surpass due to their domain specialization.
  • Figure 2: Accuracies of Sabiá-3, Sabiá-2 Medium and GPT-4o on Enade 2022 and 2023 exams, ordered from low to high based on Sabiá-2 Medium performance. Sabiá-3 outperforms Sabiá-2 Medium on 76% of the exams.
  • Figure 3: Win, tie and loss rates for Sabiá-3 against other LLMs on the BRACEval conversation benchmark according to GPT-4-turbo as a judge.
  • Figure 4: Category-wise adjusted win rates of Sabiá-3 against other models on BRACEval, with 0.5 representing a tie; models that score below it are worse than Sabiá-3; above it, competitor models are superior.
  • Figure 5: Performance of the Sabiá-3 model in the Portuguese-adapted Needle-in-the-Haystack (NIAH) benchmark.
  • ...and 1 more figures