Table of Contents
Fetching ...

RAG vs. GraphRAG: A Systematic Evaluation and Key Insights

Haoyu Han, Li Ma, Harry Shomer, Yu Wang, Yongjia Lei, Kai Guo, Zhigang Hua, Bo Long, Hui Liu, Charu C. Aggarwal, Jiliang Tang

TL;DR

This work provides the first unified, systematic evaluation of Retrieval-Augmented Generation (RAG) and Graph Retrieval-Augmented Generation (GraphRAG) on general text-based tasks, specifically Question Answering and Query-based Summarization. It demonstrates that RAG excels in single-hop, detail-rich retrieval while GraphRAG, particularly Community-GraphRAG Local, shines in multi-hop reasoning and structured summarization, revealing complementary strengths. The authors introduce two hybrid retrieval strategies—Selection and Integration—to leverage both approaches, achieving notable QA gains (up to 6.4 percentage points) and offering practical trade-offs in cost and latency. They also analyze limitations such as incomplete graph coverage and evaluation biases, and outline future directions for graph construction, retrieval methods, and unbiased evaluation in GraphRAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances the performance of LLMs across various tasks by retrieving relevant information from external sources, particularly on text-based data. For structured data, such as knowledge graphs, GraphRAG has been widely used to retrieve relevant information. However, recent studies have revealed that structuring implicit knowledge from text into graphs can benefit certain tasks, extending the application of GraphRAG from graph data to general text-based data. Despite their successful extensions, most applications of GraphRAG for text data have been designed for specific tasks and datasets, lacking a systematic evaluation and comparison between RAG and GraphRAG on widely used text-based benchmarks. In this paper, we systematically evaluate RAG and GraphRAG on well-established benchmark tasks, such as Question Answering and Query-based Summarization. Our results highlight the distinct strengths of RAG and GraphRAG across different tasks and evaluation perspectives. Inspired by these observations, we investigate strategies to integrate their strengths to improve downstream tasks. Additionally, we provide an in-depth discussion of the shortcomings of current GraphRAG approaches and outline directions for future research.

RAG vs. GraphRAG: A Systematic Evaluation and Key Insights

TL;DR

This work provides the first unified, systematic evaluation of Retrieval-Augmented Generation (RAG) and Graph Retrieval-Augmented Generation (GraphRAG) on general text-based tasks, specifically Question Answering and Query-based Summarization. It demonstrates that RAG excels in single-hop, detail-rich retrieval while GraphRAG, particularly Community-GraphRAG Local, shines in multi-hop reasoning and structured summarization, revealing complementary strengths. The authors introduce two hybrid retrieval strategies—Selection and Integration—to leverage both approaches, achieving notable QA gains (up to 6.4 percentage points) and offering practical trade-offs in cost and latency. They also analyze limitations such as incomplete graph coverage and evaluation biases, and outline future directions for graph construction, retrieval methods, and unbiased evaluation in GraphRAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances the performance of LLMs across various tasks by retrieving relevant information from external sources, particularly on text-based data. For structured data, such as knowledge graphs, GraphRAG has been widely used to retrieve relevant information. However, recent studies have revealed that structuring implicit knowledge from text into graphs can benefit certain tasks, extending the application of GraphRAG from graph data to general text-based data. Despite their successful extensions, most applications of GraphRAG for text data have been designed for specific tasks and datasets, lacking a systematic evaluation and comparison between RAG and GraphRAG on widely used text-based benchmarks. In this paper, we systematically evaluate RAG and GraphRAG on well-established benchmark tasks, such as Question Answering and Query-based Summarization. Our results highlight the distinct strengths of RAG and GraphRAG across different tasks and evaluation perspectives. Inspired by these observations, we investigate strategies to integrate their strengths to improve downstream tasks. Additionally, we provide an in-depth discussion of the shortcomings of current GraphRAG approaches and outline directions for future research.

Paper Structure

This paper contains 32 sections, 9 figures, 33 tables.

Figures (9)

  • Figure 1: The illustration of RAG, KG-based GraphRAGs and Community-based GraphRAGs.
  • Figure 2: Confusion matrices comparing GraphRAG and RAG correctness across datasets using Llama 3.1-8B.
  • Figure 3: Overall QA performance comparison of different methods.
  • Figure 4: Comparison of LLM-as-a-Judge evaluations for RAG and GraphRAG. "Local" refers to the evaluation of RAG vs. GraphRAG-Local, while "Global" refers to RAG vs. GraphRAG-Global.
  • Figure 5: Case 1 from Hotpot dataset.
  • ...and 4 more figures