Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
Rodrigo Santos, João Rodrigues, Luís Gomes, João Silva, António Branco, Henrique Lopes Cardoso, Tomás Freitas Osório, Bernardo Leite
TL;DR
This work tackles the scarcity of openly licensed, Portuguese-specific encoders by introducing Albertina 100M PT and Albertina 1.5B PT, two open encoders for both European Portuguese (PTPT) and Brazilian Portuguese (PTBR). The authors continue pre-training DeBERTa-based bases on language-specific corpora (OSCAR, CulturaX, DCEP, Europarl, ParlamentoPT) and evaluate through translations of GLUE/SuperGLUE and the ASSIN 2 benchmark, followed by extensive fine-tuning hyper-parameter searches. The larger 1.5B model achieves state-of-the-art performance among open encoders for Portuguese across many tasks, while the 100M model provides a compact, efficient alternative with competitive results. The work significantly expands accessible Portuguese NLP resources, with models and datasets openly distributed via HuggingFace PORTULAN for research and commercial use, enabling broader uptake and further development in the ecosystem.
Abstract
To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While achieving this primary goal, further results that are relevant for this ecosystem were obtained as well, namely new datasets for Portuguese based on the SuperGLUE benchmark, which we also distribute openly.
