The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design
Artem Snegirev, Maria Tikhonova, Anna Maksimova, Alena Fenogenova, Alexander Abramov
TL;DR
The paper tackles the scarcity of Russian-focused embeddings by introducing ru-en-RoSBERTa and ruMTEB, a Russian extension of the Massive Text Embedding Benchmark. It details a 23-dataset, 7-category benchmark with 17 new Russian tasks, and demonstrates a contrastive fine-tuning pipeline that leverages cross-lingual data for improved embeddings. The proposed approach shows competitive performance against state-of-the-art Russian models and emphasizes the benefits of cross-lingual transfer and synthetic data, releasing open-source code and a public leaderboard to accelerate Russian NLP research. Together, ru-en-RoSBERTa and ruMTEB provide a practical, extensible framework for evaluating and improving Russian text embeddings in a multilingual setting.
Abstract
Embedding models play a crucial role in Natural Language Processing (NLP) by creating text embeddings used in various tasks such as information retrieval and assessing semantic text similarity. This paper focuses on research related to embedding models in the Russian language. It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark, the Russian version extending the Massive Text Embedding Benchmark (MTEB). Our benchmark includes seven categories of tasks, such as semantic textual similarity, text classification, reranking, and retrieval.The research also assesses a representative set of Russian and multilingual models on the proposed benchmark. The findings indicate that the new model achieves results that are on par with state-of-the-art models in Russian. We release the model ru-en-RoSBERTa, and the ruMTEB framework comes with open-source code, integration into the original framework and a public leaderboard.
