Table of Contents
Fetching ...

LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval

Joohyung Yun, Doyup Lee, Wook-Shin Han

TL;DR

LILaC advances open-domain multimodal retrieval by embedding documents in a two-layer layered component graph that supports coarse and fine-grained reasoning, and by employing late-interaction-based subgraph retrieval guided by LLM-driven query decomposition. The method enables efficient multihop reasoning within and across documents, scoring edges on-the-fly using fine-grained subcomponents to reduce noise from irrelevant content. Across five benchmarks, LILaC achieves state-of-the-art retrieval and end-to-end QA performance without additional fine-tuning, demonstrating the effectiveness of combining dual-granularity graphs with late interaction and modality-aware query decomposition. The approach highlights the value of pretrained multimodal encoders and LLMs for scalable, tuning-free improvement in open-domain multimodal retrieval tasks.

Abstract

Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements. An effective multimodal retriever needs to handle two main challenges: (1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and (2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents. To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations. First, we introduce a layered component graph, explicitly representing multimodal information at two layers - each representing coarse and fine granularity - facilitating efficient yet precise reasoning. Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction. Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on all five benchmarks, notably without additional fine-tuning. We make the artifacts publicly available at github.com/joohyung00/lilac.

LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval

TL;DR

LILaC advances open-domain multimodal retrieval by embedding documents in a two-layer layered component graph that supports coarse and fine-grained reasoning, and by employing late-interaction-based subgraph retrieval guided by LLM-driven query decomposition. The method enables efficient multihop reasoning within and across documents, scoring edges on-the-fly using fine-grained subcomponents to reduce noise from irrelevant content. Across five benchmarks, LILaC achieves state-of-the-art retrieval and end-to-end QA performance without additional fine-tuning, demonstrating the effectiveness of combining dual-granularity graphs with late interaction and modality-aware query decomposition. The approach highlights the value of pretrained multimodal encoders and LLMs for scalable, tuning-free improvement in open-domain multimodal retrieval tasks.

Abstract

Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements. An effective multimodal retriever needs to handle two main challenges: (1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and (2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents. To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations. First, we introduce a layered component graph, explicitly representing multimodal information at two layers - each representing coarse and fine granularity - facilitating efficient yet precise reasoning. Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction. Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on all five benchmarks, notably without additional fine-tuning. We make the artifacts publicly available at github.com/joohyung00/lilac.
Paper Structure (34 sections, 13 equations, 7 figures, 6 tables)

This paper contains 34 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Challenges of TextRAG approaches and VisRAG approaches. (a) Incorrect summarization may result in possible information loss in TextRAG. (b) Insufficient retrieval granularity in VisRAG. (c) Limited multihop reasoning due to loss of links in VisRAG.
  • Figure 2: Overview of LILaC. (a) A layered component graph is constructed by organizing multimodal documents into coarse- and fine-grained layers. (b) The query is decomposed, followed by modality classification for each subquery. (c) LILaC dynamically retrieves a query-relevant subgraph through iterative beam-search traversal.
  • Figure 3: An example case of edge-level late interaction.
  • Figure 4: (a) Comparison of average algorithm execution times across different methods, and (b) detailed runtime breakdown of LILaC.
  • Figure 5: Change in retrieval accuracy with varying parameter values.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1: Layered Component Graph
  • Definition 2: Subcomponent