Table of Contents
Fetching ...

Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval

Adel Elmahdy, Sheng-Chieh Lin, Amin Ahmad

TL;DR

This work tackles multilingual information retrieval by aiming to optimize monolingual, cross-lingual, and multilingual performance simultaneously. It introduces a hybrid batch training strategy that blends monolingual and cross-lingual QA batches to learn language-agnostic representations using a dual-encoder with contrastive learning and InfoNCE loss. Across XQuAD-R, MLQA-R, and MIRACL, the hybrid approach achieves competitive zero-shot performance and notably reduces language bias compared with baselines, while maintaining strong monolingual and cross-lingual capabilities. The method demonstrates strong zero-shot generalization to unseen languages, highlighting its practical potential for broad linguistic coverage in multilingual IR systems.

Abstract

Information retrieval across different languages is an increasingly important challenge in natural language processing. Recent approaches based on multilingual pre-trained language models have achieved remarkable success, yet they often optimize for either monolingual, cross-lingual, or multilingual retrieval performance at the expense of others. This paper proposes a novel hybrid batch training strategy to simultaneously improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings while mitigating language bias. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL benchmark datasets show that the proposed method consistently achieves comparable or superior results in zero-shot retrieval across various languages and retrieval tasks compared to monolingual-only or cross-lingual-only training. Hybrid batch training also substantially reduces language bias in multilingual retrieval compared to monolingual training. These results demonstrate the effectiveness of the proposed approach for learning language-agnostic representations that enable strong zero-shot retrieval performance across diverse languages.

Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval

TL;DR

This work tackles multilingual information retrieval by aiming to optimize monolingual, cross-lingual, and multilingual performance simultaneously. It introduces a hybrid batch training strategy that blends monolingual and cross-lingual QA batches to learn language-agnostic representations using a dual-encoder with contrastive learning and InfoNCE loss. Across XQuAD-R, MLQA-R, and MIRACL, the hybrid approach achieves competitive zero-shot performance and notably reduces language bias compared with baselines, while maintaining strong monolingual and cross-lingual capabilities. The method demonstrates strong zero-shot generalization to unseen languages, highlighting its practical potential for broad linguistic coverage in multilingual IR systems.

Abstract

Information retrieval across different languages is an increasingly important challenge in natural language processing. Recent approaches based on multilingual pre-trained language models have achieved remarkable success, yet they often optimize for either monolingual, cross-lingual, or multilingual retrieval performance at the expense of others. This paper proposes a novel hybrid batch training strategy to simultaneously improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings while mitigating language bias. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL benchmark datasets show that the proposed method consistently achieves comparable or superior results in zero-shot retrieval across various languages and retrieval tasks compared to monolingual-only or cross-lingual-only training. Hybrid batch training also substantially reduces language bias in multilingual retrieval compared to monolingual training. These results demonstrate the effectiveness of the proposed approach for learning language-agnostic representations that enable strong zero-shot retrieval performance across diverse languages.
Paper Structure (25 sections, 1 equation, 2 figures, 13 tables)

This paper contains 25 sections, 1 equation, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Illustrative example of monolingual, cross-lingual, and multilingual information retrieval.
  • Figure 2: Illustrations of the proposed hybrid batch sampling (assuming we only have training data in English, Arabic, and Japanese), where our model is exposed to monolingual and cross-lingual batches with the respective probability of $\alpha$ and $\beta = 1 - \alpha$.