Table of Contents
Fetching ...

Unifying Multimodal Retrieval via Document Screenshot Embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin

TL;DR

Document Screenshot Embedding (DSE) introduces a unified, OCR-free retrieval paradigm that encodes entire document screenshots with a large vision–language model in a bi-encoder setup. By preserving layout and multimodal content, DSE achieves strong retrieval performance on text-intensive Wiki-SS and mixed-modality SlideVQA Open tasks, outperforming traditional text-based baselines and OCR-only approaches. The approach leverages patch-based visual encoding and contrastive learning, with detailed exploration of patch granularity, zero-shot generalization, and qualitative analysis illustrating how visual context complements textual information. This work suggests a scalable, end-to-end retrieval framework that can feed directly into retrieval-augmented generation (V-RAG) systems, reducing reliance on content extraction and enabling more faithful document understanding in real-world, multimodal settings.

Abstract

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.

Unifying Multimodal Retrieval via Document Screenshot Embedding

TL;DR

Document Screenshot Embedding (DSE) introduces a unified, OCR-free retrieval paradigm that encodes entire document screenshots with a large vision–language model in a bi-encoder setup. By preserving layout and multimodal content, DSE achieves strong retrieval performance on text-intensive Wiki-SS and mixed-modality SlideVQA Open tasks, outperforming traditional text-based baselines and OCR-only approaches. The approach leverages patch-based visual encoding and contrastive learning, with detailed exploration of patch granularity, zero-shot generalization, and qualitative analysis illustrating how visual context complements textual information. This work suggests a scalable, end-to-end retrieval framework that can feed directly into retrieval-augmented generation (V-RAG) systems, reducing reliance on content extraction and enabling more faithful document understanding in real-world, multimodal settings.

Abstract

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.
Paper Structure (38 sections, 3 equations, 7 figures, 5 tables)

This paper contains 38 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison between (a) existing document retrieval paradigm and (b) our proposed paradigm. DSE bypasses the document parsing and content extraction process, directly encoding the original appearance of documents with multimodal contents into a dense representation for indexing
  • Figure 2: Overview of DSE encoder architecture. DSE adopts a bi-encoder architecture, where the document tower encodes the document screenshot into dense vector by taking vision input and the query tower encodes the query by taking text input. Document and query encoders share the same language model.
  • Figure 3: A snapshot of a Wikipedia webpage divided by different numbers of patches (red small squares). As the number of patches increases, each patch can capture more fine-grained text information in the screenshot. $(C_x, C_y)$ means the image are divided into $C_x \times C_y$ sub-images; then converted into $(C_x\times24)\times(C_y\times24)$ patches. See more detail in Section \ref{['sec:method']} and Figure \ref{['fig:enter-label']}.
  • Figure 4: Trade-off between effectiveness and efficiency of DSE with varying numbers of crops for input images. The inference speed is measured on a single H100 GPU with BF16 precision and FlashAttention enabled.
  • Figure 5: Case study on two examples in Wikipedia and SlideQA. We visualize the multi-head attention from the fine-tuned embedding to the image patches at the last layer. GLOBAL-HEAD is the attention head to the coarse image features (336$\times$336), while the LOCAL-HEAD is the attention head to more fine-grained image features after cropping (16$\times$336$\times$336).
  • ...and 2 more figures