Table of Contents
Fetching ...

PropRAG: Guiding Retrieval with Beam Search over Proposition Paths

Jingjin Wang, Jiawei Han

TL;DR

PropRAG tackles the limitations of traditional RAG by adopting context-rich propositions as knowledge units and introducing an offline proposition graph with an online, LLM-free beam search to discover multi-hop reasoning paths. The framework couples a two-stage retrieval strategy—Stage 1 coarse subgraph induction via PPR and Stage 2 beam-search path discovery and ranking—to efficiently assemble coherent evidence chains without online LLM calls. Empirical results on MuSiQue, 2Wiki, and HotpotQA show state-of-the-art zero-shot Recall@5 and F1, with ablations confirming the value of propositions, graph guidance, and beam search. Efficiency analysis indicates a favorable offline-online trade-off: higher upfront proposition extraction costs yield significantly better retrieval quality while avoiding costly online LLM inference during retrieval, enabling practical multi-hop evidence gathering for LLMs.

Abstract

Retrieval Augmented Generation (RAG) has become the standard approach for equipping Large Language Models (LLMs) with up-to-date knowledge. However, standard RAG, relying on independent passage retrieval, often fails to capture the interconnected nature of information required for complex, multi-hop reasoning. While structured RAG methods attempt to address this using knowledge graphs built from triples, we argue that the inherent context loss of triples (context collapse) limits the fidelity of the knowledge representation. We introduce PropRAG, a novel RAG framework that shifts from triples to context-rich propositions and introduces an efficient, LLM-free online beam search over proposition paths to discover multi-step reasoning chains. By coupling a higher-fidelity knowledge representation with explicit path discovery, PropRAG achieves state-of-the-art zero-shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue, advancing non-parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.

PropRAG: Guiding Retrieval with Beam Search over Proposition Paths

TL;DR

PropRAG tackles the limitations of traditional RAG by adopting context-rich propositions as knowledge units and introducing an offline proposition graph with an online, LLM-free beam search to discover multi-hop reasoning paths. The framework couples a two-stage retrieval strategy—Stage 1 coarse subgraph induction via PPR and Stage 2 beam-search path discovery and ranking—to efficiently assemble coherent evidence chains without online LLM calls. Empirical results on MuSiQue, 2Wiki, and HotpotQA show state-of-the-art zero-shot Recall@5 and F1, with ablations confirming the value of propositions, graph guidance, and beam search. Efficiency analysis indicates a favorable offline-online trade-off: higher upfront proposition extraction costs yield significantly better retrieval quality while avoiding costly online LLM inference during retrieval, enabling practical multi-hop evidence gathering for LLMs.

Abstract

Retrieval Augmented Generation (RAG) has become the standard approach for equipping Large Language Models (LLMs) with up-to-date knowledge. However, standard RAG, relying on independent passage retrieval, often fails to capture the interconnected nature of information required for complex, multi-hop reasoning. While structured RAG methods attempt to address this using knowledge graphs built from triples, we argue that the inherent context loss of triples (context collapse) limits the fidelity of the knowledge representation. We introduce PropRAG, a novel RAG framework that shifts from triples to context-rich propositions and introduces an efficient, LLM-free online beam search over proposition paths to discover multi-step reasoning chains. By coupling a higher-fidelity knowledge representation with explicit path discovery, PropRAG achieves state-of-the-art zero-shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue, advancing non-parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of a traditional Knowledge Graph (KG) versus a Proposition Graph for a complex passage. Left: The triple-based KG struggles to natively represent provenance ("archival records") and conditional clauses. It results in disconnected facts where the crucial context (the conditionality of the Emancipation Proclamation taking effect) is omitted, leading to context loss. Right: The PropRAG proposition graph utilizes implicit hyper-edges (fully connected cliques within shaded ovals) to link all entities co-occurring within a single, context-rich proposition. This structure directly preserves nuances like conditionality and provenance.
  • Figure 2: The two-stage online retrieval architecture of PropRAG. Stage 1 (Coarse Filtering): Employs exploratory PPR (high damping factor) on the full proposition graph $G$ to induce a focused, relevant subgraph ($G_{sub}$). Stage 2 (Fine Reasoning): Executes a graph-guided beam search on $G_{sub}$ to discover explicit reasoning paths (illustrated in Figure \ref{['fig:qualitative_beam_main']}), generates refined relevance signals based on these paths, applies exploitative PPR (low damping factor) on $G_{sub}$ using the refined signals, and selects the final top-$k_{out}$ evidence passages.
  • Figure 3: Running Example: Beam search execution ($L_{max}=3$) for a MuSiQue query, illustrating the discovery of a multi-hop reasoning path in PropRAG. Proposition text is abridged for clarity.
  • Figure 4: LLM prompt for Entity Extraction. This prompt aims for comprehensive entity identification beyond standard NER.
  • Figure 5: LLM prompt for Proposition Extraction. This prompt emphasizes contextual completeness and adherence to pre-identified entities.