Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
João Rodrigues, Luís Gomes, João Silva, António Branco, Rodrigo Santos, Henrique Lopes Cardoso, Tomás Osório
TL;DR
This work introduces Albertina PT-*, a Transformer-based encoder family for Portuguese built from DeBERTa V2 XLarge, with distinct PT-PT and PT-BR variants. It pre-trains on PT-BR brWaC and PT-PT data from OSCAR, DCEP, Europarl, and ParlamentoPT, then evaluates on ASSIN 2 and translated GLUE benchmarks (PLUE), achieving new state-of-the-art results for PT-BR and competitive performance for PT-PT. The models are publicly released and demonstrate improved efficiency relative to prior PT models like BERTimbau, while highlighting the value of language-variant specialization. The work also discusses cross-variant transfer, evaluation caveats (offline GLUE vs online benchmarks), and directions for future improvements, including larger-scale training and higher-quality corpora.
Abstract
To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.
