NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes
Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun
TL;DR
NodeRAG addresses the limitations of graph-based RAG by introducing a heterogeneous graph (heterograph) with seven node types that unify entities, relationships, semantic units, and higher-level insights. The framework builds the graph through decomposition, augmentation, and enrichment, and enables retrieval via a dual-search and shallow PPR strategy, augmented with selective embeddings and HNSW edges. Empirical results show NodeRAG achieves superior accuracy and lower retrieval tokens across multi-hop benchmarks and Arena evaluations, outperforming GraphRAG, LightRAG, and NaiveRAG while offering better indexing and query efficiency. The work underscores the importance of graph structure design in RAG and demonstrates a cohesive, end-to-end workflow that tightly couples LLM capabilities with graph algorithms for fine-grained, explainable retrieval in complex corpora.
Abstract
Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.
