Table of Contents
Fetching ...

NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes

Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun

TL;DR

NodeRAG addresses the limitations of graph-based RAG by introducing a heterogeneous graph (heterograph) with seven node types that unify entities, relationships, semantic units, and higher-level insights. The framework builds the graph through decomposition, augmentation, and enrichment, and enables retrieval via a dual-search and shallow PPR strategy, augmented with selective embeddings and HNSW edges. Empirical results show NodeRAG achieves superior accuracy and lower retrieval tokens across multi-hop benchmarks and Arena evaluations, outperforming GraphRAG, LightRAG, and NaiveRAG while offering better indexing and query efficiency. The work underscores the importance of graph structure design in RAG and demonstrates a cohesive, end-to-end workflow that tightly couples LLM capabilities with graph algorithms for fine-grained, explainable retrieval in complex corpora.

Abstract

Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.

NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes

TL;DR

NodeRAG addresses the limitations of graph-based RAG by introducing a heterogeneous graph (heterograph) with seven node types that unify entities, relationships, semantic units, and higher-level insights. The framework builds the graph through decomposition, augmentation, and enrichment, and enables retrieval via a dual-search and shallow PPR strategy, augmented with selective embeddings and HNSW edges. Empirical results show NodeRAG achieves superior accuracy and lower retrieval tokens across multi-hop benchmarks and Arena evaluations, outperforming GraphRAG, LightRAG, and NaiveRAG while offering better indexing and query efficiency. The work underscores the importance of graph structure design in RAG and demonstrates a cohesive, end-to-end workflow that tightly couples LLM capabilities with graph algorithms for fine-grained, explainable retrieval in complex corpora.

Abstract

Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.

Paper Structure

This paper contains 47 sections, 16 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Comparsions between NodeRAG and other RAG systems. NaïveRAG retrieving fragmented text chunks, leads to redundant information. HippoRAG introduces knowledge graphs but lacks high-level summarization. GraphRAG retrieves community summaries but may still produce coarse-grained information. LightRAG incorporates one-hop neighbors but retrieves redundant nodes. In contrast, NodeRAG utilizes multiple node types, including high-level elements, semantic units, and relationships, enabling more precise, hierarchical retrieval while reducing irrelevant information.
  • Figure 2: Main indexing workflow of NodeRAG. It illustrates the step-by-step construction of the heterograph, including the process of graph decomposition, graph augmentation, and graph enrichment
  • Figure 3: This figure focuses on the querying process, where entry points are extracted from the original query, followed by searching for related nodes that need to be retrieved in the heterograph.
  • Figure 4: Ablation analysis on PPR iterations.
  • Figure :
  • ...and 5 more figures