Table of Contents
Fetching ...

AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs

Yubo Wang, Haoyang Li, Fei Teng, Lei Chen

TL;DR

AGRAG tackles three core shortcomings of graph-based RAG: inaccurate graph construction, opaque reasoning, and incomplete answers. It replaces LLM-driven entity extraction with a TFIDF-based approach, constructs a weighted KG, and generates a Minimum Cost Maximum Influence (MCMI) subgraph to provide explicit reasoning paths, including cycles, to the LLM. A greedy 2-approximation solves the NP-hard MCMI generation, enabling more comprehensive, query-focused reasoning than Steiner-tree baselines. Empirical results across six tasks show AGRAG achieves superior effectiveness and efficiency compared with state-of-the-art RAG models, particularly on complex summarization and reasoning tasks, with robust parameter settings and informative ablations. This work advances practical RAG by combining explicit reasoning graphs with hybrid retrieval to improve grounding, faithfulness, and scalability in real-world LLM-assisted QA and generation tasks.

Abstract

Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. When retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG's reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths.

AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs

TL;DR

AGRAG tackles three core shortcomings of graph-based RAG: inaccurate graph construction, opaque reasoning, and incomplete answers. It replaces LLM-driven entity extraction with a TFIDF-based approach, constructs a weighted KG, and generates a Minimum Cost Maximum Influence (MCMI) subgraph to provide explicit reasoning paths, including cycles, to the LLM. A greedy 2-approximation solves the NP-hard MCMI generation, enabling more comprehensive, query-focused reasoning than Steiner-tree baselines. Empirical results across six tasks show AGRAG achieves superior effectiveness and efficiency compared with state-of-the-art RAG models, particularly on complex summarization and reasoning tasks, with robust parameter settings and informative ablations. This work advances practical RAG by combining explicit reasoning graphs with hybrid retrieval to improve grounding, faithfulness, and scalability in real-world LLM-assisted QA and generation tasks.

Abstract

Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. When retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG's reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths.

Paper Structure

This paper contains 29 sections, 2 theorems, 19 equations, 5 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

The Minimum Cost Maximum Influence Subgraph Generation problem is NP-hard.

Figures (5)

  • Figure 1: An overview of Graph-based RAG models. It first extracts entities and relations as graph nodes and edges. When retrieval, they will map query to graph nodes, and utilizing the graph structure to support query answering.
  • Figure 2: An overview of AGRAG. In Step 1, AGRAG constructs a KG based on the text corpus. In Step 2, the query is mapped to relevant triplet facts in KG, then node score and edge weights are assigned based on PPR algorithm and the semantic similarity between query and each fact, respectively. In Step 3, AGRAG generates a MCMI subgraph, providing explicit reasoning chains in addition the most semantically similar text chunks by Hybrid Retrieval (HR) to support LLM query answering.
  • Figure 3: An example of the MCMI generated. Red triples indicate those mapped from the query. Our algorithm first constructs the MCMI based on a Steiner tree rooted at these mapped triples, then iteratively expands it by adding neighboring triples with the highest influence-cost ratio $s/c$, continuing until no neighboring triples have this ratio greater than the average influence-cost ratio of the current graph, covering more query related triples (in orange).
  • Figure 4: Parameter sensitivity experiment on GraphRAG-bench, where x-axis of each heatmap denotes the entity extraction threshold, and y-axis denotes the maximum n-gram. We choose accuracy as the metric; the darker the color of each block, the better the accuracy of the corresponding parameter pair.
  • Figure 5: Two cases of AGRAG w.r.t. its statistics-based entity extraction method and MCMI generation step from GraphRAG-bench's novel dataset.

Theorems & Definitions (5)

  • Definition 1: Minimum Cost Maximum Influence Subgraph Generation Problem
  • Theorem 1
  • proof
  • Theorem 2
  • proof