Table of Contents
Fetching ...

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

Ze Liu, Zhengyang Liang, Junjie Zhou, Zheng Liu, Defu Lian

TL;DR

This work defines Visualized Information Retrieval (Vis-IR), proposing a unified visual representation for multimodal data via Screenshots. It introduces VIRA, a 13M-shot dataset with caption and QA annotations; UniSE, a two-branch embedding family (CLIP-based and MLLM-based) trained in two stages; and MVRB, a comprehensive benchmark spanning four task categories and multiple domains. Empirical results show UniSE outperforms existing multimodal and screenshot-specific retrievers, and that dedicated Vis-IR data and training strategies substantially improve cross-modal retrieval and QA performance. The project aims to accelerate Vis-IR development by releasing datasets, models, and benchmarks to support robust, domain-diverse, and open-ended visualized information retrieval research and applications.

Abstract

With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Visualized Information Retrieval}, or \textbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called \textbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create \textbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop \textbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct \textbf{MVRB} (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our work will be shared with the community, laying a solid foundation for this emerging field.

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

TL;DR

This work defines Visualized Information Retrieval (Vis-IR), proposing a unified visual representation for multimodal data via Screenshots. It introduces VIRA, a 13M-shot dataset with caption and QA annotations; UniSE, a two-branch embedding family (CLIP-based and MLLM-based) trained in two stages; and MVRB, a comprehensive benchmark spanning four task categories and multiple domains. Empirical results show UniSE outperforms existing multimodal and screenshot-specific retrievers, and that dedicated Vis-IR data and training strategies substantially improve cross-modal retrieval and QA performance. The project aims to accelerate Vis-IR development by releasing datasets, models, and benchmarks to support robust, domain-diverse, and open-ended visualized information retrieval research and applications.

Abstract

With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Visualized Information Retrieval}, or \textbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called \textbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create \textbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop \textbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct \textbf{MVRB} (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our work will be shared with the community, laying a solid foundation for this emerging field.

Paper Structure

This paper contains 51 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: A use case of Vis-IR. Users take a screenshot of their interested news by circling and selection, and search for relevant news reports associated with "Nvidia" based on a query conditioned on the screenshot.
  • Figure 2: Creation process of VIRA dataset, including 1) comprehensive screenshot collection from various sources, 2) fine-grained screenshot captioning, 3) similar screenshots mining, 4) q2s annotation, and 5) sq2s annotation.
  • Figure 3: MVRB benchmark. There are four task categories: screenshot retrieval, composed screenshot retrieval, screenshot question answering, and open-vocab classification. Each category covers multiple concrete task scenarios.
  • Figure 4: The prompt used for q2s annotation. This prompt is designed for the news domain. For other domains, the word in blue can be substituted with the appropriate term for that domain.
  • Figure 5: The prompt used for sq2s annotation. This prompt is intended for the paper domain. For other domains, the word in blue can be substituted with the appropriate term for that domain.
  • ...and 5 more figures