Open Sentence Embeddings for Portuguese with the Serafim PT* encoders family
Luís Gomes, António Branco, João Silva, João Rodrigues, Rodrigo Santos
TL;DR
The paper tackles the lack of high-quality Portuguese sentence encoders by introducing Serafim PT*, a family of open-source encoders available in 100M, 335M, and 900M parameter sizes to accommodate diverse compute budgets. It presents a four-stage sequential training pipeline that combines supervised objectives (STS with CoSENT and AnglE) and unsupervised methods (TSDAE/CT with a GISTEmbed guide) to produce strong Portuguese embeddings for semantic similarity and information retrieval tasks. Empirical results show Serafim PT* achieving state-of-the-art performance on Portuguese STS benchmarks and IR on MS MARCO, with model size and language variant influencing performance. The work provides open-source models for both PTPT and PTBR variants and outlines future directions such as data augmentation and language-variant-specific training to further improve results.
Abstract
Sentence encoder encode the semantics of their input, enabling key downstream applications such as classification, clustering, or retrieval. In this paper, we present Serafim PT*, a family of open-source sentence encoders for Portuguese with various sizes, suited to different hardware/compute budgets. Each model exhibits state-of-the-art performance and is made openly available under a permissive license, allowing its use for both commercial and research purposes. Besides the sentence encoders, this paper contributes a systematic study and lessons learned concerning the selection criteria of learning objectives and parameters that support top-performing encoders.
