SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

Prasham Titiya; Rohit Khoja; Tomer Wolfson; Vivek Gupta; Dan Roth

SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

Prasham Titiya, Rohit Khoja, Tomer Wolfson, Vivek Gupta, Dan Roth

TL;DR

This work proposes a SAGE (Structure Aware Graph Expansion) framework that constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k'additional chunks.

Abstract

Retrieval-augmented question answering over heterogeneous corpora requires connected evidence across text, tables, and graph nodes. While entity-level knowledge graphs support structured access, they are costly to construct and maintain, and inefficient to traverse at query time. In contrast, standard retriever-reader pipelines use flat similarity search over independently chunked text, missing multi-hop evidence chains across modalities. We propose SAGE (Structure Aware Graph Expansion) framework that (i) constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and (ii) performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k' additional chunks. We instantiate the initial retriever using hybrid dense+sparse retrieval for implicit cross-modal corpora and SPARK (Structure Aware Planning Agent for Retrieval over Knowledge Graphs) an agentic retriever for explicit schema graphs. On OTT-QA and STaRK, SAGE improves retrieval recall by 5.7 and 8.5 points over baselines.

SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

TL;DR

Abstract

Paper Structure (83 sections, 6 equations, 19 figures, 20 tables, 2 algorithms)

This paper contains 83 sections, 6 equations, 19 figures, 20 tables, 2 algorithms.

Introduction
SAGE Approach
Offline Graph Construction
A. Data Processing and Node Creation
1. Semantic chunking of documents.
2. Table segmentation.
3. Semi-structured nodes.
B. Edge creation
Graph construction via metadata similarities.
Edge metadata for traversal.
Online Retrieval
A. Initial baseline retrieval.
1. Similarity graphs
2. Semi-Structured Knowledge Bases (SKBs)
B. Graph-based neighbor expansion and pruning.
...and 68 more sections

Figures (19)

Figure 1: Illustration of SAGE’s seed$\rightarrow$expand retrieval: starting from an initial relevant chunk, graph expansion retrieves a connected neighbor containing the missing movie/director evidence that flat retrieval fails to surface.
Figure 2: Overview of SAGE. Offline: we semantically chunk documents and tables into nodes and create edges using metadata-driven similarity (e.g., title, topic, content, entities) with percentile-based pruning. Online: a baseline retriever returns $k$ seed nodes, we expand to $n$ first-hop neighbors in the offline graph, and re-rank to select $k'$ additional nodes, yielding a final context of $k{+}k'$.
Figure 3: SKB baseline retrieval: the question and schema metadata guide an LLM agent to generate a retrieval plan interleaving HNSW semantic search with Cypher symbolic queries.
Figure 4: OTT-QA retrieval Recall@$k$ (%, higher is better). BL (Baseline) is a flat hybrid retriever combining BM25 and dense embeddings. BL+Graph retrieves $k_1$ seeds with BL and adds $k_2$ one-hop neighbors from the induced graph ($k=k_1+k_2$), while Extended BL retrieves $k$ items directly with BL. Arrows annotate the absolute gain at each $k$.
Figure 5: Left: column similarity distribution. Right: number of entities per table chunk (left-skewed).
...and 14 more figures

SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

TL;DR

Abstract

SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data

Authors

TL;DR

Abstract

Table of Contents

Figures (19)