PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data
Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo
TL;DR
PTT5 demonstrates that pretraining T5 on a large Brazilian Portuguese corpus (BrWac) yields gains over the original multilingual T5 on Portuguese tasks. Introducing a dedicated Portuguese vocabulary further improves performance, particularly for ASSIN 2 and NER, while full-model pretraining outperforms vocabulary-only updates in certain settings. The approach achieves competitive results against state-of-the-art Portuguese models, though it remains slightly behind the best BERT-based systems on some benchmarks. The work highlights the value of language-specific corpora and tokenization in seq2seq pretrained models for non-English languages, and provides reusable resources for the community.
Abstract
In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different tasks. We show that our Portuguese pretrained models have significantly better performance over the original T5 models. Moreover, we demonstrate the positive impact of using a Portuguese vocabulary. Our code and models are available at https://github.com/unicamp-dl/PTT5.
