Table of Contents
Fetching ...

PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

Thales Sales Almeida, Ramon Pires, Hugo Abonizio, Rodrigo Nogueira, Hélio Pedrini

TL;DR

PoETa v2 introduces the most extensive Portuguese LLM evaluation to date, combining 44 tasks (12 native, 32 translated) across 20+ models to quantify how compute and language-specific pretraining impact performance. The benchmark uses a FLOPs-based Computational Cost metric and Normalized Preferred Metric to enable fair cross-task comparisons, revealing that larger, Portuguese-adapted models generally perform better but that a persistent English-Portuguese performance gap remains, especially for smaller models. The work highlights the value of native Portuguese tasks, transparency in pretraining data, and bias/robustness analyses, and it establishes PoETa v2 as an open foundation for ongoing, regionally grounded NLP research. Overall, PoETa v2 provides both a practical evaluation suite and key insights into how linguistic adaptation and resource investment shape LLM capabilities in Portuguese.

Abstract

Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.

PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

TL;DR

PoETa v2 introduces the most extensive Portuguese LLM evaluation to date, combining 44 tasks (12 native, 32 translated) across 20+ models to quantify how compute and language-specific pretraining impact performance. The benchmark uses a FLOPs-based Computational Cost metric and Normalized Preferred Metric to enable fair cross-task comparisons, revealing that larger, Portuguese-adapted models generally perform better but that a persistent English-Portuguese performance gap remains, especially for smaller models. The work highlights the value of native Portuguese tasks, transparency in pretraining data, and bias/robustness analyses, and it establishes PoETa v2 as an open foundation for ongoing, regionally grounded NLP research. Overall, PoETa v2 provides both a practical evaluation suite and key insights into how linguistic adaptation and resource investment shape LLM capabilities in Portuguese.

Abstract

Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the distribution of task types and categories in PoETa v2. Each task is assigned a primary type and may belong to multiple subcategories.
  • Figure 2: Computational cost versus average NPM score for the evaluated models on PoETa v2. Colors distinguish different model families.
  • Figure 3: Average NPM across task subcategories plotted against computational cost.
  • Figure 4: Heatmap of NPM for Tucano 2.4B and TinyLlama 1T in 20 PoETa v2 tasks. The average performance between both models is shown in the last line of the heatmap. The last column shows the average performance of each model over all 20 tasks.
  • Figure 5: Scaling trends per task type in PoETa v2.
  • ...and 1 more figures