Table of Contents
Fetching ...

Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates

TL;DR

MILCO presents a novel multilingual learned sparse retriever that maps queries and documents from 39 languages into a shared English lexical space using a lightweight Multilingual Connector. The framework introduces Sparse Alignment Pretraining (SAP) to ground multilingual text to English lexicons and Sparse Contrastive Training (SCT) with distillation to sharpen retrieval performance, augmented by a LexEcho Head that preserves important source-language tokens for tail-entities. Empirical results show state-of-the-art multilingual and cross-lingual sparse retrieval on MIRACL, MTEB, MLDR, and MKQA benchmarks, with strong zero-shot cross-lingual capabilities and robust performance under aggressive post-hoc pruning. The approach yields transparent, interpretable sparse representations and enables dynamic efficiency via pruning without sacrificing accuracy. Overall, MILCO advances scalable, transparent cross-lingual information retrieval by unifying multilingual and cross-lingual retrieval within a single, prune-friendly model.

Abstract

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions.

Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

TL;DR

MILCO presents a novel multilingual learned sparse retriever that maps queries and documents from 39 languages into a shared English lexical space using a lightweight Multilingual Connector. The framework introduces Sparse Alignment Pretraining (SAP) to ground multilingual text to English lexicons and Sparse Contrastive Training (SCT) with distillation to sharpen retrieval performance, augmented by a LexEcho Head that preserves important source-language tokens for tail-entities. Empirical results show state-of-the-art multilingual and cross-lingual sparse retrieval on MIRACL, MTEB, MLDR, and MKQA benchmarks, with strong zero-shot cross-lingual capabilities and robust performance under aggressive post-hoc pruning. The approach yields transparent, interpretable sparse representations and enables dynamic efficiency via pruning without sacrificing accuracy. Overall, MILCO advances scalable, transparent cross-lingual information retrieval by unifying multilingual and cross-lingual retrieval within a single, prune-friendly model.

Abstract

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions.

Paper Structure

This paper contains 39 sections, 10 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: MILCO's LexEcho head produces two lexical views: (1) an English view supporting cross-lingual and multilingual retrieval, and (2) a source view for robustness to uncommon entities.
  • Figure 2: Sparse representations with different training strategies. Alignment only produces many grounded tokens (green) but also distantly relevant tokens (orange), Contrastive further prunes and refines. Contrastive-only suffers from semantic collapse, drifting toward ungrounded tokens (red).
  • Figure 3: The tail entity Momo is missing in the English view of the query and Doc2, reducing Doc2’s score despite its higher relevance. The LexEcho head resolves this by selectively retaining missing entities from source tokens, correctly ranking Doc2 on top.
  • Figure 4: Model size versus effectiveness on MIRACL. MILCO is lightweight (560M params), while being highly effective.
  • Figure 5: Effectiveness (nDCG@10, MIRACL) of MILCO with varying sparsity levels obtained by post-hoc pruning methods.
  • ...and 2 more figures