Table of Contents
Fetching ...

Tucano: Advancing Neural Text Generation for Portuguese

Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah

TL;DR

The paper tackles the data scarcity and evaluation fragmentation in Portuguese NLP by building GigaVerbo, a large, deduplicated Portuguese corpus, and training Tucano, a family of open-source decoder-only Llama-based models. It emphasizes open science by releasing datasets, code, and logs, and it critically analyzes how benchmark performance correlates with token ingestion, showing strong gains on several Portuguese-native evaluations while highlighting overfitting risks and evaluation limitations. The work demonstrates competitive performance against multilingual and native baselines for models of similar size, while also illuminating energy usage and the need for sustainable, reproducible research practices. Overall, Tucano provides a scalable, transparent framework to advance Portuguese language modeling and serves as a blueprint for similar efforts in other low-resource languages.

Abstract

Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face. See https://nkluge-correa.github.io/Tucano/

Tucano: Advancing Neural Text Generation for Portuguese

TL;DR

The paper tackles the data scarcity and evaluation fragmentation in Portuguese NLP by building GigaVerbo, a large, deduplicated Portuguese corpus, and training Tucano, a family of open-source decoder-only Llama-based models. It emphasizes open science by releasing datasets, code, and logs, and it critically analyzes how benchmark performance correlates with token ingestion, showing strong gains on several Portuguese-native evaluations while highlighting overfitting risks and evaluation limitations. The work demonstrates competitive performance against multilingual and native baselines for models of similar size, while also illuminating energy usage and the need for sustainable, reproducible research practices. Overall, Tucano provides a scalable, transparent framework to advance Portuguese language modeling and serves as a blueprint for similar efforts in other low-resource languages.

Abstract

Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face. See https://nkluge-correa.github.io/Tucano/

Paper Structure

This paper contains 19 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: This timeline illustrates several Portuguese language model releases from 2020 to October 2024. The models are color-coded to indicate their respective Portuguese language variants, e.g., green for South America and blue for Europe. The timeline also distinguishes pre-trained models from fine-tuned derivatives of other foundations. We limited the models displayed in this timeline to those we could find tied to publication reports, unpublished manuscripts, peer-reviewed papers, and popular repositories.
  • Figure 2: This graph shows the distribution of scores for 4 Subsets of GigaVerbo. We determined that the text would have a "high" quality if the GPT-4o scores were >= 0.8 and "low" when <= 0.6, thus keeping our dataset with a more balanced proportion of labels for our classifiers. Above, we see that datasets like monoHPLT and Corpus Carolina have some of the lowest-quality samples. Also, given that GPT-4o is extremely sensitive to toxic and harmful content, samples containing toxic, dangerous, or NSFW content end up being scored very low (< 0.1), given as a way to account for the toxicity in our dataset. Analyzing samples from the Wikipedia portion scored by GPT-4o, we found that the model consistently gives low scores (< 0.5) to ill-formatted, incomplete, or excessively short documents (< 20 words). This classification/regression dataset is available on https://huggingface.co/datasets/TucanoBR/GigaVerbo-Text-Filter.
  • Figure 3: The figure above lets us understand specific relationships between vocabulary size and the respective tokenizer's capabilities regarding compression. For example, models that use the Llama 2 tokenizer (e.g., Sabiá), primarily focused on English, do not encode Portuguese very efficiently. On a similar note, Sabiá-2 has the worst performance across all tokenizers, even though it has double the vocab size of its predecessor. Meanwhile, multilingual models, like mBERT, PolyLM, Llama 3, mT5, and mGPT, improve their compression efficiency by having significantly enlarged vocabularies, with Bloom, XGLM, and XLM being close to the top of this comparison, all using massive multilingual vocabularies with > 250,000 tokens. As a middle ground between efficiency and resource consumption (i.e., larger vocabularies imply larger embedding matrices, which then imply more computational requirements for inference or training), we have tokenizers with vocabularies tailored for the Portuguese domain (e.g., BERTabaporu, TeenyTinyLlama, BERTimbau). In summary, while multilingual (or larger) vocabularies generally offer improved compression, small, domain-specific tokenizers balance efficiency and computational resource consumption. The code for replicating this test is available in https://github.com/Nkluge-correa/Tucano/tree/main/logs/README.md.
  • Figure 4: We tested several batch sizes on our 160 million parameter model, from 1 million (512) to 65 thousand tokens (32), while also trying to reproduce a 0.5 million batch (256) via different levels of gradient accumulation (i.e., 2, 4, and 8 accumulation steps). We maintained the learning rate and the $\beta$ values of the AdamW constant for all these tests, together with a linear warm-up of 1,000 steps. As expected, the 1 million tokens batch, with no gradient accumulation steps (i.e., step of 1), produced the best loss curve with faster convergence at the earliest stages of training, followed by all other batch sizes (i.e., 256, 128, 64, and 32) that did not have gradient accumulation steps. At the same time, the more gradient steps are applied to achieve a desired batch size, the slower the convergence rate, up to the point that training with a global batch size of 32 and a global batch of 512 achieved via 2 gradient accumulation steps yield the same results in terms of convergence speed. While the plot on the left shows the shape of the loss curve for several different batch sizes and gradient accumulation configurations, the plot on the right shows the rate of change in loss ($d_{loss}$) for the first 200 million tokens. While this rate of change tends to converge to the same value for all experimented batch sizes (i.e., with time, all lines converge at the same rate), the initial values differ significantly in the early stages of training, with bigger "natural" batches presenting a higher rate of change. Although not through extensive exploration, we observed the same behavior for our other model sizes, independent of tweaks to the learning rate hyper-settings or changes in the number of warm-up steps.
  • Figure 5: All logs from our training runs recorded loss, evaluation loss, the current value of the learning rate, and the gradient norm for that specific optimization step. These logs are available in our https://github.com/Nkluge-correa/Tucano/tree/main/logs/README.md repository.
  • ...and 2 more figures