Unifying Multimodal Retrieval via Document Screenshot Embedding
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
TL;DR
Document Screenshot Embedding (DSE) introduces a unified, OCR-free retrieval paradigm that encodes entire document screenshots with a large vision–language model in a bi-encoder setup. By preserving layout and multimodal content, DSE achieves strong retrieval performance on text-intensive Wiki-SS and mixed-modality SlideVQA Open tasks, outperforming traditional text-based baselines and OCR-only approaches. The approach leverages patch-based visual encoding and contrastive learning, with detailed exploration of patch granularity, zero-shot generalization, and qualitative analysis illustrating how visual context complements textual information. This work suggests a scalable, end-to-end retrieval framework that can feed directly into retrieval-augmented generation (V-RAG) systems, reducing reliance on content extraction and enabling more faithful document understanding in real-world, multimodal settings.
Abstract
In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.
