Table of Contents
Fetching ...

xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, Giovanni Semeraro

TL;DR

The paper addresses the limitation of English-centric LVLM-based embeddings by introducing xVLM2Vec, a Self-Knowledge Distillation approach that trains a multilingual, multimodal embedding model from an English-only LVLM baseline. A frozen teacher and a trainable student are trained on parallel English↔non-English data using a two-part loss that aligns non-English embeddings to the English space while preserving the English representations, with training conducted via LoRA and FSDP on a large parallel corpus. To evaluate multilingual multimodal performance, the authors propose MMMEB, the first benchmark for such models, aggregating multiple datasets and tasks across five languages with a carefully constructed evaluation protocol. Results show that xVLM2Vec improves non-English embedding quality while maintaining English performance, though CLIP/SigLIP baselines remain strong overall; the work also provides substantial resources (data, models, benchmark, code) to advance research in multilingual vision-language embedding. The findings support the practicality of multilingual adaptation for LVLM-based embeddings and establish a reproducible framework for future benchmarking and development.

Abstract

In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.

xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation

TL;DR

The paper addresses the limitation of English-centric LVLM-based embeddings by introducing xVLM2Vec, a Self-Knowledge Distillation approach that trains a multilingual, multimodal embedding model from an English-only LVLM baseline. A frozen teacher and a trainable student are trained on parallel English↔non-English data using a two-part loss that aligns non-English embeddings to the English space while preserving the English representations, with training conducted via LoRA and FSDP on a large parallel corpus. To evaluate multilingual multimodal performance, the authors propose MMMEB, the first benchmark for such models, aggregating multiple datasets and tasks across five languages with a carefully constructed evaluation protocol. Results show that xVLM2Vec improves non-English embedding quality while maintaining English performance, though CLIP/SigLIP baselines remain strong overall; the work also provides substantial resources (data, models, benchmark, code) to advance research in multilingual vision-language embedding. The findings support the practicality of multilingual adaptation for LVLM-based embeddings and establish a reproducible framework for future benchmarking and development.

Abstract

In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.

Paper Structure

This paper contains 8 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed Self-Knowledge distillation approach. $x$ refers to a non-English language out of the ones we consider in our training mixture (French, German, Italian and Spanish). Training is done on a parallel dataset of pairs of English language and $x$ language texts. The teacher model is kept frozen while the student model's weights are updated.
  • Figure 2: Comparison of two possible translation approaches considering a translation from English to Italian. In the example above, the translation methodology described in \ref{['sec:translation']} is applied, while a direct translation methodology is applied in the example below. The term "right", which in this case refers to the concept of "direction" (e.g. "The car is on the right side"), is correctly translated in the first example, while it is not correctly translated in the second, where instead the translation model translates "right" with the concept of "correctness" (e.g. "Yes, that's right!"). These examples are obtained from the translation model MADLAD-400 3B. The example is from an actual instance in the training dataset from the A-OKVQA task.
  • Figure 3: Results for each language obtained by averaging the P@1 on the tasks for VLM2Vec and xVLM2Vec when using the plain formatting.
  • Figure 4: Results for each language obtained by averaging the P@1 on the tasks for VLM2Vec and xVLM2Vec when using punctuation in the formatting.