Tucano: Advancing Neural Text Generation for Portuguese

Nicholas Kluge Corrêa; Aniket Sen; Sophia Falk; Shiza Fatimah

Tucano: Advancing Neural Text Generation for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah

TL;DR

The paper tackles the data scarcity and evaluation fragmentation in Portuguese NLP by building GigaVerbo, a large, deduplicated Portuguese corpus, and training Tucano, a family of open-source decoder-only Llama-based models. It emphasizes open science by releasing datasets, code, and logs, and it critically analyzes how benchmark performance correlates with token ingestion, showing strong gains on several Portuguese-native evaluations while highlighting overfitting risks and evaluation limitations. The work demonstrates competitive performance against multilingual and native baselines for models of similar size, while also illuminating energy usage and the need for sustainable, reproducible research practices. Overall, Tucano provides a scalable, transparent framework to advance Portuguese language modeling and serves as a blueprint for similar efforts in other low-resource languages.

Abstract

Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face. See https://nkluge-correa.github.io/Tucano/

Tucano: Advancing Neural Text Generation for Portuguese

TL;DR

Abstract

Tucano: Advancing Neural Text Generation for Portuguese

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)