Table of Contents
Fetching ...

StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

Tengjun Ni, Xin Yuan, Shenghong Li, Kai Wu, Ren Ping Liu, Wei Ni, Wenjie Zhang

TL;DR

StepChain GraphRAG tackles multi-hop QA by integrating question decomposition with a BFS-based reasoning flow and incremental knowledge-graph maintenance, enabling targeted, verifiable evidence chains at each reasoning step. The framework constructs a dynamic knowledge graph from on-the-fly retrieved passages, updates it after each sub-question, and merges partial answers with community summaries for a grounded final response. Empirical results on MuSiQue, 2WikiMultiHopQA, and HotpotQA show state-of-the-art EM and F1 gains, with notable improvements on HotpotQA and clear ablation-supported evidence for the contribution of decomposition, graph retrieval, and reasoning pathways. While effective, the approach incurs higher computational overhead due to graph construction and LLM usage, motivating future work on efficiency, uncertainty handling, and minimizing hallucinations in graph-guided, multi-hop QA.

Abstract

Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.

StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

TL;DR

StepChain GraphRAG tackles multi-hop QA by integrating question decomposition with a BFS-based reasoning flow and incremental knowledge-graph maintenance, enabling targeted, verifiable evidence chains at each reasoning step. The framework constructs a dynamic knowledge graph from on-the-fly retrieved passages, updates it after each sub-question, and merges partial answers with community summaries for a grounded final response. Empirical results on MuSiQue, 2WikiMultiHopQA, and HotpotQA show state-of-the-art EM and F1 gains, with notable improvements on HotpotQA and clear ablation-supported evidence for the contribution of decomposition, graph retrieval, and reasoning pathways. While effective, the approach incurs higher computational overhead due to graph construction and LLM usage, motivating future work on efficiency, uncertainty handling, and minimizing hallucinations in graph-guided, multi-hop QA.

Abstract

Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.

Paper Structure

This paper contains 20 sections, 12 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of StepChain GraphRAG applied to a Harry Potter example. On the left, a partial knowledge graph encodes entities (e.g., Voldemort, Nagini) and relationships ("create," "destroy") derived via entity extraction and linking. On the right, our system decomposes the user's query, "Who destroys the last Horcrux of Voldemort?", into sub-questions about (1) how Horcruxes relate to Voldemort, (2) which Horcrux is the final one, and (3) who destroys it. A BFS-based graph traversal gathers multi-hop evidence chains (e.g., "Voldemort $\to$ creates $\to$ Horcruxes," "Nagini $\to$ final Horcrux," "Neville $\to$ destroys $\to$ Nagini"), and partial answers from each sub-question are merged to form the conclusion: Neville is the one who destroys Voldemort's last Horcrux.
  • Figure 2: An overview of the StepChain GraphRAG pipeline. First, the corpus is split into chunks, and retrieved chunks are parsed on‑the‑fly to extract entities and relations and are upserted into a knowledge graph. Next, a complex question is decomposed into multiple sub-questions, each answered via BFS-RF that traverses the graph to find relevant entities and relations. The discovered evidence chains are combined to yield partial answers for each sub-question. Finally, these partial answers are synthesized by the LLM to produce the final, fully grounded response.