Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval
Adel Elmahdy, Sheng-Chieh Lin, Amin Ahmad
TL;DR
This work tackles multilingual information retrieval by aiming to optimize monolingual, cross-lingual, and multilingual performance simultaneously. It introduces a hybrid batch training strategy that blends monolingual and cross-lingual QA batches to learn language-agnostic representations using a dual-encoder with contrastive learning and InfoNCE loss. Across XQuAD-R, MLQA-R, and MIRACL, the hybrid approach achieves competitive zero-shot performance and notably reduces language bias compared with baselines, while maintaining strong monolingual and cross-lingual capabilities. The method demonstrates strong zero-shot generalization to unseen languages, highlighting its practical potential for broad linguistic coverage in multilingual IR systems.
Abstract
Information retrieval across different languages is an increasingly important challenge in natural language processing. Recent approaches based on multilingual pre-trained language models have achieved remarkable success, yet they often optimize for either monolingual, cross-lingual, or multilingual retrieval performance at the expense of others. This paper proposes a novel hybrid batch training strategy to simultaneously improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings while mitigating language bias. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL benchmark datasets show that the proposed method consistently achieves comparable or superior results in zero-shot retrieval across various languages and retrieval tasks compared to monolingual-only or cross-lingual-only training. Hybrid batch training also substantially reduces language bias in multilingual retrieval compared to monolingual training. These results demonstrate the effectiveness of the proposed approach for learning language-agnostic representations that enable strong zero-shot retrieval performance across diverse languages.
