Table of Contents
Fetching ...

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo

TL;DR

PTT5 demonstrates that pretraining T5 on a large Brazilian Portuguese corpus (BrWac) yields gains over the original multilingual T5 on Portuguese tasks. Introducing a dedicated Portuguese vocabulary further improves performance, particularly for ASSIN 2 and NER, while full-model pretraining outperforms vocabulary-only updates in certain settings. The approach achieves competitive results against state-of-the-art Portuguese models, though it remains slightly behind the best BERT-based systems on some benchmarks. The work highlights the value of language-specific corpora and tokenization in seq2seq pretrained models for non-English languages, and provides reusable resources for the community.

Abstract

In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different tasks. We show that our Portuguese pretrained models have significantly better performance over the original T5 models. Moreover, we demonstrate the positive impact of using a Portuguese vocabulary. Our code and models are available at https://github.com/unicamp-dl/PTT5.

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

TL;DR

PTT5 demonstrates that pretraining T5 on a large Brazilian Portuguese corpus (BrWac) yields gains over the original multilingual T5 on Portuguese tasks. Introducing a dedicated Portuguese vocabulary further improves performance, particularly for ASSIN 2 and NER, while full-model pretraining outperforms vocabulary-only updates in certain settings. The approach achieves competitive results against state-of-the-art Portuguese models, though it remains slightly behind the best BERT-based systems on some benchmarks. The work highlights the value of language-specific corpora and tokenization in seq2seq pretrained models for non-English languages, and provides reusable resources for the community.

Abstract

In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different tasks. We show that our Portuguese pretrained models have significantly better performance over the original T5 models. Moreover, we demonstrate the positive impact of using a Portuguese vocabulary. Our code and models are available at https://github.com/unicamp-dl/PTT5.

Paper Structure

This paper contains 18 sections, 2 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Pretraining cross-entropy loss for different model sizes and vocabularies. a) training the whole model; b) training vocabulary embeddings only.
  • Figure 2: Training and validation curves comparing the string generation and linear layer approaches on the similarity and entailment tasks, starting from T5 small (a and b) and T5 base weights (c and d).
  • Figure 3: Training and validation loss curves on ASSIN 2 tasks, starting from the pretrained PTT5 weights.