Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents
Juri Grosjean, Jannis Vamvas
TL;DR
The study addresses the need for high-quality multilingual sentence/document embeddings in a Swiss context by fine-tuning SwissBERT with a SimCSE-style contrastive objective on ~1.5M Swiss news articles, using title+lead versus body as the positive input and MEAN pooling with language adapters. The resulting SentenceSwissBERT demonstrates superior performance to the base SwissBERT and a strong multilingual Sentence-BERT baseline on monolingual and cross-lingual document retrieval and text classification, with Romansh experiencing the largest gains up to 55 percentage points. The approach highlights the effectiveness of modular adapters combined with contrastive fine-tuning for cross-language embedding quality and semantic search tasks, and the model is openly available for research use. Limitations include training only on news-domain data and a 512-token input limit, with future work proposed to extend to other domains and data sources.
Abstract
Encoder models trained for the embedding of sentences or short documents have proven useful for tasks such as semantic search and topic modeling. In this paper, we present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland -- German, French, Italian, and Romansh -- and has been pre-trained on a large number of news articles in those languages. Using contrastive learning based on a subset of these articles, we trained a fine-tuned version, which we call SentenceSwissBERT. Multilingual experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model and of a comparable baseline. The model is openly available for research use.
