ModernVBERT: Towards Smaller Visual Document Retrievers

Paul Teiletche; Quentin Macé; Max Conti; Antonio Loison; Gautier Viaud; Pierre Colombo; Manuel Faysse

ModernVBERT: Towards Smaller Visual Document Retrievers

Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse

TL;DR

This work critically reevaluates how visual information from documents should be integrated for retrieval. Through controlled experiments on modality alignment, attention masking, image resolution, and contrastive training, it shows that token-level, bidirectional interactions yield the best document-retrieval performance, especially with late-interaction matching. Building on these insights, the authors introduce ModernVBERT, a compact 250M multimodal encoder that, when fine-tuned for document retrieval (ColModernVBERT), matches or approaches the performance of models an order of magnitude larger while offering CPU-friendly inference and lower latency. The release of models and code aims to enable cost-effective, scalable visual document retrieval in industrial settings and to spur further research into efficient multimodal embeddings.

Abstract

Retrieving specific information from a large corpus of documents is a prevalent industrial use case of modern AI, notably due to the popularity of Retrieval-Augmented Generation (RAG) systems. Although neural document retrieval models have historically operated exclusively in the text space, Visual Document Retrieval (VDR) models - large vision-language decoders repurposed as embedding models which directly work with page screenshots as inputs - are increasingly popular due to the performance and indexing latency gains they offer. In this work, we show that, while cost-efficient, this approach of repurposing generative models bottlenecks retrieval performance. Through controlled experiments, we revisit the entire training pipeline, and establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms recent models up to 10 times larger when fine-tuned on document retrieval tasks, enabling efficient inference on cheap CPU hardware and greatly reducing latency and costs while maintaining strong performance. Models, code and data are available at https://huggingface.co/ModernVBERT.

ModernVBERT: Towards Smaller Visual Document Retrievers

TL;DR

Abstract

ModernVBERT: Towards Smaller Visual Document Retrievers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)