Table of Contents
Fetching ...

Towards Text-Image Interleaved Retrieval

Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang

TL;DR

This work introduces the text-image interleaved retrieval (TIIR) task and the wikiHow-TIIR benchmark to address real-world scenarios where queries and documents contain interleaved text and images. It demonstrates that preserving interleaved structure is crucial for effective TIIR and proposes the Matryoshka Multimodal Embedder (MME), a token-compression strategy that generates nested visual token sets to balance information richness with efficiency. Through extensive experiments, native interleaved retrievers consistently outperform adapted non-interleaved models, and MME delivers substantial improvements with fewer visual tokens and better encoding efficiency. The study provides detailed analyses of interleaved context modeling, adaptation strategies, and token dynamics, offering practical insights for future TIIR research and benchmark development.

Abstract

Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.

Towards Text-Image Interleaved Retrieval

TL;DR

This work introduces the text-image interleaved retrieval (TIIR) task and the wikiHow-TIIR benchmark to address real-world scenarios where queries and documents contain interleaved text and images. It demonstrates that preserving interleaved structure is crucial for effective TIIR and proposes the Matryoshka Multimodal Embedder (MME), a token-compression strategy that generates nested visual token sets to balance information richness with efficiency. Through extensive experiments, native interleaved retrievers consistently outperform adapted non-interleaved models, and MME delivers substantial improvements with fewer visual tokens and better encoding efficiency. The study provides detailed analyses of interleaved context modeling, adaptation strategies, and token dynamics, offering practical insights for future TIIR research and benchmark development.

Abstract

Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.

Paper Structure

This paper contains 43 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of our Text-Image Interleaved Retrieval task to previous settings. Blocks with black borders represent data in text, image or fused-modal.
  • Figure 2: Our data construction workflow (§\ref{['sec:tiir:build']}), where step (a), (b) and (c) comprise the generation pipeline, and (d) shows the brief annotation guideline. Technical details and principles are provided in Appendix \ref{['sec:app:data:query-gen']} and \ref{['sec:app:data:annotation']}.
  • Figure 3: Our TIIR model overview, where (a) is the DPR baseline (§\ref{['sec:method:baseline']}), (b) illustrates the computation of visual tokens in different granularities, and (c) shows the training strategies of our MME.
  • Figure 4: Results of interleaved models evaluated on settings of original data, shuffled image ordering, shuffled image position, and shuffled image ordering & position.
  • Figure 5: Performance curve of different settings of Matryoshka-style visual token, where all three different training strategies (§\ref{['sec:method:mme']}) are presented. The best one (mean) is selected as the final model.
  • ...and 9 more figures