Towards Text-Image Interleaved Retrieval
Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
TL;DR
This work introduces the text-image interleaved retrieval (TIIR) task and the wikiHow-TIIR benchmark to address real-world scenarios where queries and documents contain interleaved text and images. It demonstrates that preserving interleaved structure is crucial for effective TIIR and proposes the Matryoshka Multimodal Embedder (MME), a token-compression strategy that generates nested visual token sets to balance information richness with efficiency. Through extensive experiments, native interleaved retrievers consistently outperform adapted non-interleaved models, and MME delivers substantial improvements with fewer visual tokens and better encoding efficiency. The study provides detailed analyses of interleaved context modeling, adaptation strategies, and token dynamics, offering practical insights for future TIIR research and benchmark development.
Abstract
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.
