Table of Contents
Fetching ...

SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

Prasham Titiya, Rohit Khoja, Tomer Wolfson, Vivek Gupta, Dan Roth

TL;DR

This work proposes a SAGE (Structure Aware Graph Expansion) framework that constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k'additional chunks.

Abstract

Retrieval-augmented question answering over heterogeneous corpora requires connected evidence across text, tables, and graph nodes. While entity-level knowledge graphs support structured access, they are costly to construct and maintain, and inefficient to traverse at query time. In contrast, standard retriever-reader pipelines use flat similarity search over independently chunked text, missing multi-hop evidence chains across modalities. We propose SAGE (Structure Aware Graph Expansion) framework that (i) constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and (ii) performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k' additional chunks. We instantiate the initial retriever using hybrid dense+sparse retrieval for implicit cross-modal corpora and SPARK (Structure Aware Planning Agent for Retrieval over Knowledge Graphs) an agentic retriever for explicit schema graphs. On OTT-QA and STaRK, SAGE improves retrieval recall by 5.7 and 8.5 points over baselines.

SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

TL;DR

This work proposes a SAGE (Structure Aware Graph Expansion) framework that constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k'additional chunks.

Abstract

Retrieval-augmented question answering over heterogeneous corpora requires connected evidence across text, tables, and graph nodes. While entity-level knowledge graphs support structured access, they are costly to construct and maintain, and inefficient to traverse at query time. In contrast, standard retriever-reader pipelines use flat similarity search over independently chunked text, missing multi-hop evidence chains across modalities. We propose SAGE (Structure Aware Graph Expansion) framework that (i) constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and (ii) performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k' additional chunks. We instantiate the initial retriever using hybrid dense+sparse retrieval for implicit cross-modal corpora and SPARK (Structure Aware Planning Agent for Retrieval over Knowledge Graphs) an agentic retriever for explicit schema graphs. On OTT-QA and STaRK, SAGE improves retrieval recall by 5.7 and 8.5 points over baselines.
Paper Structure (83 sections, 6 equations, 19 figures, 20 tables, 2 algorithms)

This paper contains 83 sections, 6 equations, 19 figures, 20 tables, 2 algorithms.

Figures (19)

  • Figure 1: Illustration of SAGE’s seed$\rightarrow$expand retrieval: starting from an initial relevant chunk, graph expansion retrieves a connected neighbor containing the missing movie/director evidence that flat retrieval fails to surface.
  • Figure 2: Overview of SAGE. Offline: we semantically chunk documents and tables into nodes and create edges using metadata-driven similarity (e.g., title, topic, content, entities) with percentile-based pruning. Online: a baseline retriever returns $k$ seed nodes, we expand to $n$ first-hop neighbors in the offline graph, and re-rank to select $k'$ additional nodes, yielding a final context of $k{+}k'$.
  • Figure 3: SKB baseline retrieval: the question and schema metadata guide an LLM agent to generate a retrieval plan interleaving HNSW semantic search with Cypher symbolic queries.
  • Figure 4: OTT-QA retrieval Recall@$k$ (%, higher is better). BL (Baseline) is a flat hybrid retriever combining BM25 and dense embeddings. BL+Graph retrieves $k_1$ seeds with BL and adds $k_2$ one-hop neighbors from the induced graph ($k=k_1+k_2$), while Extended BL retrieves $k$ items directly with BL. Arrows annotate the absolute gain at each $k$.
  • Figure 5: Left: column similarity distribution. Right: number of entities per table chunk (left-skewed).
  • ...and 14 more figures