Table of Contents
Fetching ...

SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation

Yuzheng Cai, Zhenyue Guo, Yiwen Pei, Wanrui Bian, Weiguo Zheng

TL;DR

SimGRAG tackles the challenge of grounding LLM outputs with large knowledge graphs by decoupling query–KG alignment into two stages: Query-to-Pattern alignment and Pattern-to-Subgraph alignment. It introduces Graph Semantic Distance (GSD) to quantify how well a pattern graph matches a candidate subgraph, and it employs an optimized top-$k$ retrieval algorithm to fetch semantically aligned subgraphs from massive KGs in under a second. By verbalizing retrieved subgraphs and using few-shot prompts, SimGRAG achieves strong performance on KGQA and fact verification without requiring oracle entities or KG-specific training. The method demonstrates robustness across multiple LLMs and datasets, offering a scalable, plug-and-play approach for KG-driven RAG with practical retrieval latency and high contextual conciseness for LLM reasoning.

Abstract

Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate their hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top-k subgraphs within 1-second on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification. Our code is available at https://github.com/YZ-Cai/SimGRAG.

SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation

TL;DR

SimGRAG tackles the challenge of grounding LLM outputs with large knowledge graphs by decoupling query–KG alignment into two stages: Query-to-Pattern alignment and Pattern-to-Subgraph alignment. It introduces Graph Semantic Distance (GSD) to quantify how well a pattern graph matches a candidate subgraph, and it employs an optimized top- retrieval algorithm to fetch semantically aligned subgraphs from massive KGs in under a second. By verbalizing retrieved subgraphs and using few-shot prompts, SimGRAG achieves strong performance on KGQA and fact verification without requiring oracle entities or KG-specific training. The method demonstrates robustness across multiple LLMs and datasets, offering a scalable, plug-and-play approach for KG-driven RAG with practical retrieval latency and high contextual conciseness for LLM reasoning.

Abstract

Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate their hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top-k subgraphs within 1-second on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification. Our code is available at https://github.com/YZ-Cai/SimGRAG.

Paper Structure

This paper contains 50 sections, 6 equations, 5 figures, 19 tables, 1 algorithm.

Figures (5)

  • Figure 1: Ideal features for KG-driven RAG methods.
  • Figure 2: Comparison of mechanisms for aligning query text with KG structures. The example task is fact verification, where the query comes from FactKG dataset FactKG with DBpedia DBpedia.
  • Figure 3: Overview of the SimGRAG method.
  • Figure 4: Semantic L2 distance rankings of a given keyword with entities (relations) in DBpedia DBpedia, computed using the embeddings generated by the Nomic model nomic.
  • Figure 5: Pareto optimal curves for retrieval.

Theorems & Definitions (4)

  • Definition 1: Graph Isomorphism
  • Definition 2: Graph Semantic Distance, GSD
  • Example 1
  • Example 2