Table of Contents
Fetching ...

ModernVBERT: Towards Smaller Visual Document Retrievers

Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse

TL;DR

This work critically reevaluates how visual information from documents should be integrated for retrieval. Through controlled experiments on modality alignment, attention masking, image resolution, and contrastive training, it shows that token-level, bidirectional interactions yield the best document-retrieval performance, especially with late-interaction matching. Building on these insights, the authors introduce ModernVBERT, a compact 250M multimodal encoder that, when fine-tuned for document retrieval (ColModernVBERT), matches or approaches the performance of models an order of magnitude larger while offering CPU-friendly inference and lower latency. The release of models and code aims to enable cost-effective, scalable visual document retrieval in industrial settings and to spur further research into efficient multimodal embeddings.

Abstract

Retrieving specific information from a large corpus of documents is a prevalent industrial use case of modern AI, notably due to the popularity of Retrieval-Augmented Generation (RAG) systems. Although neural document retrieval models have historically operated exclusively in the text space, Visual Document Retrieval (VDR) models - large vision-language decoders repurposed as embedding models which directly work with page screenshots as inputs - are increasingly popular due to the performance and indexing latency gains they offer. In this work, we show that, while cost-efficient, this approach of repurposing generative models bottlenecks retrieval performance. Through controlled experiments, we revisit the entire training pipeline, and establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms recent models up to 10 times larger when fine-tuned on document retrieval tasks, enabling efficient inference on cheap CPU hardware and greatly reducing latency and costs while maintaining strong performance. Models, code and data are available at https://huggingface.co/ModernVBERT.

ModernVBERT: Towards Smaller Visual Document Retrievers

TL;DR

This work critically reevaluates how visual information from documents should be integrated for retrieval. Through controlled experiments on modality alignment, attention masking, image resolution, and contrastive training, it shows that token-level, bidirectional interactions yield the best document-retrieval performance, especially with late-interaction matching. Building on these insights, the authors introduce ModernVBERT, a compact 250M multimodal encoder that, when fine-tuned for document retrieval (ColModernVBERT), matches or approaches the performance of models an order of magnitude larger while offering CPU-friendly inference and lower latency. The release of models and code aims to enable cost-effective, scalable visual document retrieval in industrial settings and to spur further research into efficient multimodal embeddings.

Abstract

Retrieving specific information from a large corpus of documents is a prevalent industrial use case of modern AI, notably due to the popularity of Retrieval-Augmented Generation (RAG) systems. Although neural document retrieval models have historically operated exclusively in the text space, Visual Document Retrieval (VDR) models - large vision-language decoders repurposed as embedding models which directly work with page screenshots as inputs - are increasingly popular due to the performance and indexing latency gains they offer. In this work, we show that, while cost-efficient, this approach of repurposing generative models bottlenecks retrieval performance. Through controlled experiments, we revisit the entire training pipeline, and establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms recent models up to 10 times larger when fine-tuned on document retrieval tasks, enabling efficient inference on cheap CPU hardware and greatly reducing latency and costs while maintaining strong performance. Models, code and data are available at https://huggingface.co/ModernVBERT.

Paper Structure

This paper contains 33 sections, 5 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Pareto efficiency.ColModernVBERT outperforms models in its category on ViDoRe, achieving a leading performance-size tradeoff.
  • Figure 2: MLM-based early fusion architecture. The visual encoder produces patch representations, which are passed to a language model. Our end-to-end bidirectional attention fused architecture is trained with Masked Language Modeling objectives and is perfectly suited for sequence and token-level representation tasks.
  • Figure 3: Impact of Modality Alignment objective on downstream tasks. Early Fusion of vision and text models boosts document retrieval tasks regardless of the LM objective, but degrades natural image and classification tasks w.r.t. the standalone fine-tuned vision model SigLIP. Reported scores are aggregated MIEB scores (nDCG, Accuracy.)
  • Figure 4: Modality alignment scaling of early fusion encoders for up to 1 epoch (3.5B tokens) of data. The dashed line indicates the vision encoder evaluated standalone without further training. Our findings show that retrieval tasks benefits from extended modality alignment phase, particularly in document retrieval, where performance quickly surpasses that of the standalone vision encoder.
  • Figure 5: Impact of attention masks and training objectives on document retrieval performances. We report the average nDCG@5 on English splits of ViDoRe benchmarks for models post-trained on ColPali.
  • ...and 7 more figures