Table of Contents
Fetching ...

DocMMIR: A Framework for Document Multi-modal Information Retrieval

Zirui Li, Siwei Wu, Yizhi Li, Xingyu Wang, Yi Zhou, Chenghua Lin

TL;DR

DocMMIR tackles the challenge of document-level multimodal information retrieval across heterogeneous domains by proposing a unified dual-encoder framework with late fusion and a symmetric BCE objective. It introduces a large cross-domain benchmark spanning Wikipedia, arXiv, and Slides, accompanied by domain-specific preprocessing and quality filtering to produce coherent multi-modal documents. The study shows that naïve zero-shot MLLMs perform poorly on this task, while task-specific fine-tuning of CLIP yields substantial gains, particularly when using simple fusion strategies and BCE loss. The findings highlight the importance of cross-domain training, simple yet robust fusion, and modality-aware preprocessing for robust document-level MMIR, with practical implications for scalable, cross-domain retrieval systems and future enhancements in layout-aware fusion and query realism.

Abstract

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.

DocMMIR: A Framework for Document Multi-modal Information Retrieval

TL;DR

DocMMIR tackles the challenge of document-level multimodal information retrieval across heterogeneous domains by proposing a unified dual-encoder framework with late fusion and a symmetric BCE objective. It introduces a large cross-domain benchmark spanning Wikipedia, arXiv, and Slides, accompanied by domain-specific preprocessing and quality filtering to produce coherent multi-modal documents. The study shows that naïve zero-shot MLLMs perform poorly on this task, while task-specific fine-tuning of CLIP yields substantial gains, particularly when using simple fusion strategies and BCE loss. The findings highlight the importance of cross-domain training, simple yet robust fusion, and modality-aware preprocessing for robust document-level MMIR, with practical implications for scalable, cross-domain retrieval systems and future enhancements in layout-aware fusion and query realism.

Abstract

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.

Paper Structure

This paper contains 50 sections, 7 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of DocMMIR. The multi-modal documents across domains are formalized into a unified framework.
  • Figure 2: t-SNE visualization of semantic embedding shifts before and after fine-tuning. Shapes indicate training stage (circle = zero-shot, square = finetuned), ellipses denote cluster variance, and arrows indicate the shift of mean embeddings.
  • Figure 3: Example entry from the DocMMIR dataset, showing a document excerpt, texts, linked images, and associated query. Source document: Diana (mythology) (Wikimedia Commons).