Table of Contents
Fetching ...

ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

Antoine Louis, Vageesh Saxena, Gijs van Dijck, Gerasimos Spanakis

TL;DR

ColBERT-XM introduces a modular, multi-vector dense retrieval framework that learns in a single high-resource language and zero-shifts to many others via language-specific adapters within the XMOD backbone. It employs a bi-encoder with MaxSim-based late interaction and a contrastive learning objective, complemented by centroid-based indexing for scalable inference. Experiments show competitive performance across languages with far less multilingual training data, strong zero-shot generalization, and a markedly reduced environmental footprint compared with existing multilingual retrievers. By enabling effective retrieval across diverse languages without language-specific labeled data, ColBERT-XM advances inclusive and sustainable information access, and code/model checkpoints are publicly released.

Abstract

State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.

ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

TL;DR

ColBERT-XM introduces a modular, multi-vector dense retrieval framework that learns in a single high-resource language and zero-shifts to many others via language-specific adapters within the XMOD backbone. It employs a bi-encoder with MaxSim-based late interaction and a contrastive learning objective, complemented by centroid-based indexing for scalable inference. Experiments show competitive performance across languages with far less multilingual training data, strong zero-shot generalization, and a markedly reduced environmental footprint compared with existing multilingual retrievers. By enabling effective retrieval across diverse languages without language-specific labeled data, ColBERT-XM advances inclusive and sustainable information access, and code/model checkpoints are publicly released.

Abstract

State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.
Paper Structure (32 sections, 10 equations, 4 figures, 4 tables)

This paper contains 32 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An illustration of ColBERT-XM's modular architecture during its successive learning stages. Components that are blurred indicate they remain frozen throughout the learning phase. (a) First, the model learns language-specific modular adapters at each transformer layer through MLM pretraining on a large multilingual corpus. (b) Next, the model is adapted to the downstream task by fine-tuning its shared weights on the source language while keeping the modular adapters and the embedding layer frozen. (c) The model is then used in a zero-shot fashion by routing the target language's input text through the corresponding modular units. (d) Finally, extra languages can be added post-hoc by learning new modular components only through lightweight MLM training on the new language.
  • Figure 2: Illustration of the multi-vector late interaction paradigm used in our proposed ColBERT-XM model.
  • Figure 3: Performance of ColBERT-XM on mMARCO small dev set, based on the volume of training examples.
  • Figure 4: MRR@10 results of our multi-vector representation retriever (ColBERT-XM) compared to its single-vector counterpart (DPR-XM) on mMARCO dev set.