Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

João Rodrigues; Luís Gomes; João Silva; António Branco; Rodrigo Santos; Henrique Lopes Cardoso; Tomás Osório

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

João Rodrigues, Luís Gomes, João Silva, António Branco, Rodrigo Santos, Henrique Lopes Cardoso, Tomás Osório

TL;DR

This work introduces Albertina PT-*, a Transformer-based encoder family for Portuguese built from DeBERTa V2 XLarge, with distinct PT-PT and PT-BR variants. It pre-trains on PT-BR brWaC and PT-PT data from OSCAR, DCEP, Europarl, and ParlamentoPT, then evaluates on ASSIN 2 and translated GLUE benchmarks (PLUE), achieving new state-of-the-art results for PT-BR and competitive performance for PT-PT. The models are publicly released and demonstrate improved efficiency relative to prior PT models like BERTimbau, while highlighting the value of language-variant specialization. The work also discusses cross-variant transfer, evaluation caveats (offline GLUE vs online benchmarks), and directions for future improvements, including larger-scale training and higher-quality corpora.

Abstract

To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

TL;DR

Abstract

Paper Structure (18 sections, 2 figures, 3 tables)

This paper contains 18 sections, 2 figures, 3 tables.

Introduction
Related Work
Encoders whose multilingual data set included Portuguese
Encoders specifically concerned with Portuguese
Data sets
Data sets for the pre-training stage
Data sets for the fine-tuning concerning downstream tasks
Albertina PT-* model
The starting encoder
Pre-training Albertina PT-BR
Pre-training Albertina PT-PT
Pre-training Albertina base models
Fine-tuning Albertina and BERTimbau
Experimental Results
Improving the state of the art on ASSIN 2 tasks
...and 3 more sections

Figures (2)

Figure 1: Training loss for Albertina PT-BR with a smoothing factor of .95 over the exponential moving average.
Figure 2: Training loss for Albertina PT-PT with a smoothing factor of .95 over the exponential moving average.

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

TL;DR

Abstract

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

Authors

TL;DR

Abstract

Table of Contents

Figures (2)