xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation
Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, Giovanni Semeraro
TL;DR
The paper addresses the limitation of English-centric LVLM-based embeddings by introducing xVLM2Vec, a Self-Knowledge Distillation approach that trains a multilingual, multimodal embedding model from an English-only LVLM baseline. A frozen teacher and a trainable student are trained on parallel English↔non-English data using a two-part loss that aligns non-English embeddings to the English space while preserving the English representations, with training conducted via LoRA and FSDP on a large parallel corpus. To evaluate multilingual multimodal performance, the authors propose MMMEB, the first benchmark for such models, aggregating multiple datasets and tasks across five languages with a carefully constructed evaluation protocol. Results show that xVLM2Vec improves non-English embedding quality while maintaining English performance, though CLIP/SigLIP baselines remain strong overall; the work also provides substantial resources (data, models, benchmark, code) to advance research in multilingual vision-language embedding. The findings support the practicality of multilingual adaptation for LVLM-based embeddings and establish a reproducible framework for future benchmarking and development.
Abstract
In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.
