Table of Contents
Fetching ...

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Adithya S Kolavi, Vyoman Jain

TL;DR

M3DR tackles the gap in multilingual vision-based document retrieval by introducing a scalable, multilingual framework that learns cross-lingual visual-text representations. It employs synthetic data generation and a bilingual-agnostic benchmark (Nayana-IR) to train and evaluate both a single dense vector model (NetraEmbed) and a ColBERT-style multi-vector model (ColNetraEmbed). The results demonstrate state-of-the-art cross-lingual and strong monolingual performance across 22 languages, with Matryoshka embeddings offering efficient deployment and a clear efficiency-accuracy trade-off. The work provides practical resources and insights for deploying multilingual document retrieval systems at scale, while outlining limitations and directions for future expansion to more languages and document regions.

Abstract

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

TL;DR

M3DR tackles the gap in multilingual vision-based document retrieval by introducing a scalable, multilingual framework that learns cross-lingual visual-text representations. It employs synthetic data generation and a bilingual-agnostic benchmark (Nayana-IR) to train and evaluate both a single dense vector model (NetraEmbed) and a ColBERT-style multi-vector model (ColNetraEmbed). The results demonstrate state-of-the-art cross-lingual and strong monolingual performance across 22 languages, with Matryoshka embeddings offering efficient deployment and a clear efficiency-accuracy trade-off. The work provides practical resources and insights for deploying multilingual document retrieval systems at scale, while outlining limitations and directions for future expansion to more languages and document regions.

Abstract

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.

Paper Structure

This paper contains 54 sections, 9 equations, 21 figures, 17 tables.

Figures (21)

  • Figure 1: Overview of NetraEmbed, our multilingual multimodal document embedding model. (A) Offline indexing encodes documents into dense vectors in a shared semantic space, (B) online retrieval processes cross-lingual queries, and (C) results show effective matching across diverse scripts and languages.
  • Figure 2: M3DR Framework Overview. Our complete pipeline encompasses synthetic data generation (layout detection, neural translation to 22 languages, visual rendering with authentic typography), query synthesis using large VLMs, dense embedding model training with Matryoshka representation learning, and multilingual document retrieval across diverse script families.
  • Figure 3: Training Strategy Comparison. Positive-only (in-batch negatives) training strategy substantially outperforms document-level negative and hard negative mining (combined text+visual) strategies, with consistent improvements throughout training.
  • Figure 4: Per Language Performance Across 22 Languages. NetraEmbed achieves consistent high performance across all languages and script families such as Latin, Devanagari, CJK, Arabic, and others, while English centric baselines show significant drops on non English content.
  • Figure 5: Base Model Comparison: ViDoRe vs Cross-lingual NDCG@5 for all baseline models. Models achieving high ViDoRe performance (English-dominated) often fail catastrophically on cross-lingual tasks.
  • ...and 16 more figures