MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Daniel Tamayo; Iñaki Lacunza; Paula Rivera-Hidalgo; Severino Da Dalt; Javier Aula-Blasco; Aitor Gonzalez-Agirre; Marta Villegas

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Marta Villegas

TL;DR

MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code, is introduced, demonstrating that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization.

Abstract

We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

TL;DR

Abstract

Paper Structure (40 sections, 4 figures, 13 tables)

This paper contains 40 sections, 4 figures, 13 tables.

Introduction
Related Work
The 'Extraction vs. Pre-training' Debate
Adaptation and Multilinguality
Flexible Representation Learning
Pre-training and Adaptation
Data
Pre-Training
Language Adaptation
Domain Adaptation
Classification of targeted domain instances
Domain Mapping and Selection Logic
Pre-Training Settings
Language and Domain Specialization
Vocabulary Adaptation and Initialization.
...and 25 more sections

Figures (4)

Figure 1: Token distribution per language for the Pre-Training phase. The table is shown in logarithmic format for visualization purposes.
Figure 2: MrBERT performance across XTREME, CLUB, and EvalES benchmarks comparing AttMAT (attention head pruning), MAT (MLP hidden size reduction), and standard models (100%). Only average scores shown.
Figure 3: Inference throughput of matryoshka variants (sequence length: 8,192 tokens).
Figure 4: Performance comparison of matryoshka models at different compression levels (25%, 50%, 75%, 100% of the attention heads) against MrBERT models without matryoshka training across four benchmark tasks. Bars represent the average performance difference to the previous higher benchmark value.

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

TL;DR

Abstract

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)