Unifying Multimodal Retrieval via Document Screenshot Embedding

Xueguang Ma; Sheng-Chieh Lin; Minghan Li; Wenhu Chen; Jimmy Lin

Unifying Multimodal Retrieval via Document Screenshot Embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin

TL;DR

Document Screenshot Embedding (DSE) introduces a unified, OCR-free retrieval paradigm that encodes entire document screenshots with a large vision–language model in a bi-encoder setup. By preserving layout and multimodal content, DSE achieves strong retrieval performance on text-intensive Wiki-SS and mixed-modality SlideVQA Open tasks, outperforming traditional text-based baselines and OCR-only approaches. The approach leverages patch-based visual encoding and contrastive learning, with detailed exploration of patch granularity, zero-shot generalization, and qualitative analysis illustrating how visual context complements textual information. This work suggests a scalable, end-to-end retrieval framework that can feed directly into retrieval-augmented generation (V-RAG) systems, reducing reliance on content extraction and enabling more faithful document understanding in real-world, multimodal settings.

Abstract

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.

Unifying Multimodal Retrieval via Document Screenshot Embedding

TL;DR

Abstract

Paper Structure (38 sections, 3 equations, 7 figures, 5 tables)

This paper contains 38 sections, 3 equations, 7 figures, 5 tables.

Introduction
Related Work
Neural Document Retrieval
Large Vision-Language Model
Document Retrieval Datasets
Method
Task Definition
Document Screenshot Embedding
Visual Encoder
Vision Language Model
Contrastive Learning
Experiment Setup
Web-Page Retrieval
Dataset
Training Data
...and 23 more sections

Figures (7)

Figure 1: Comparison between (a) existing document retrieval paradigm and (b) our proposed paradigm. DSE bypasses the document parsing and content extraction process, directly encoding the original appearance of documents with multimodal contents into a dense representation for indexing
Figure 2: Overview of DSE encoder architecture. DSE adopts a bi-encoder architecture, where the document tower encodes the document screenshot into dense vector by taking vision input and the query tower encodes the query by taking text input. Document and query encoders share the same language model.
Figure 3: A snapshot of a Wikipedia webpage divided by different numbers of patches (red small squares). As the number of patches increases, each patch can capture more fine-grained text information in the screenshot. $(C_x, C_y)$ means the image are divided into $C_x \times C_y$ sub-images; then converted into $(C_x\times24)\times(C_y\times24)$ patches. See more detail in Section \ref{['sec:method']} and Figure \ref{['fig:enter-label']}.
Figure 4: Trade-off between effectiveness and efficiency of DSE with varying numbers of crops for input images. The inference speed is measured on a single H100 GPU with BF16 precision and FlashAttention enabled.
Figure 5: Case study on two examples in Wikipedia and SlideQA. We visualize the multi-head attention from the fine-tuned embedding to the image patches at the last layer. GLOBAL-HEAD is the attention head to the coarse image features (336$\times$336), while the LOCAL-HEAD is the attention head to more fine-grained image features after cropping (16$\times$336$\times$336).
...and 2 more figures

Unifying Multimodal Retrieval via Document Screenshot Embedding

TL;DR

Abstract

Unifying Multimodal Retrieval via Document Screenshot Embedding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)