Table of Contents
Fetching ...

Unified Multimodal Interleaved Document Representation for Retrieval

Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

TL;DR

This work tackles information retrieval when documents contain multimodal content (text, images, tables) and are longer than a single context window. It introduces IDentIfy, a unified interleaved multimodal document representation built on Vision-Language Models that processes multiple modalities in one token sequence, and merges segment embeddings into a single document embedding to preserve context. A coarse-to-fine retrieval pipeline is implemented: a retriever yields document candidates from interleaved content, followed by a section-level reranker that pinpoints the most relevant passage within the retrieved document. Across four benchmark datasets, IDentIfy consistently outperforms text-only and partial-modality baselines, demonstrating the value of holistic multimodal document encoding for both document and section retrieval and highlighting the potential for scalable, multimodal IR in real-world systems.

Abstract

Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.

Unified Multimodal Interleaved Document Representation for Retrieval

TL;DR

This work tackles information retrieval when documents contain multimodal content (text, images, tables) and are longer than a single context window. It introduces IDentIfy, a unified interleaved multimodal document representation built on Vision-Language Models that processes multiple modalities in one token sequence, and merges segment embeddings into a single document embedding to preserve context. A coarse-to-fine retrieval pipeline is implemented: a retriever yields document candidates from interleaved content, followed by a section-level reranker that pinpoints the most relevant passage within the retrieved document. Across four benchmark datasets, IDentIfy consistently outperforms text-only and partial-modality baselines, demonstrating the value of holistic multimodal document encoding for both document and section retrieval and highlighting the potential for scalable, multimodal IR in real-world systems.

Abstract

Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.
Paper Structure (36 sections, 2 equations, 6 figures, 8 tables)

This paper contains 36 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of different IR approaches. (a): Conventional methods use a small portion of the text within the document for its representation. (b): Recent methods use first-page screenshot images to represent the document. (c): Our approach leverages the full contextual information within documents interleaved with multiple modalities by considering them in their original format, and is further capable of pinpointing relevant sections for the query.
  • Figure 2: Overview of the proposed IDentIfy. (a): In our document retriever, a query encoder represents a query (purple), and sections are encoded with a section encoder whose embeddings are averaged to form a document representation (blue). Contrastive learning loss (red) is used for training the document retriever. (b): Reranker scores query-section relevance with the concatenation of the query and section, trained using Binary Cross-Entropy loss.
  • Figure 3: Trade-off between performance (MRR@10) and training cost (GPU Memory) for retrieval.
  • Figure 4: Retrieval performance with different dataset sizes for training. (a): When training a retriever, large datasets rather deteriorate the retrieval performance as it may be overfitted, resulting in low generalization. (b): On the other hand, a larger dataset size is beneficial to training a re-ranker.
  • Figure 5: Retrieved documents across different document formats for document retrieval with a given textual query. (a): A document retrieved when represented leveraging interleaved multimodal contents within documents (ours). (b): A document retrieved when using only textual format
  • ...and 1 more figures