Table of Contents
Fetching ...

Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

Alberto Cattaneo, Carlo Luschi, Daniel Justus

TL;DR

This work tackles the reliability gap of KGQA-enabled LLMs by introducing SynthKGQA, a framework that generates large-scale KGQA datasets with ground-truth subgraphs and SPARQL targets from any KG. The authors instantiate GTSQA on Wikidata, a 32k-question dataset designed to probe zero-shot generalization across unseen graph structures and relation types, and provide a comprehensive benchmark of LLM-only and KG-RAG models. They show that using ground-truth subgraphs as supervision signals for training KG retrievers yields substantial gains, especially for multi-hop questions, and that conventional shortest-path supervision is often inadequate. Overall, the work advances fair benchmarking and training signal quality for KG-RAG systems, contributing to more trustworthy LLM-based reasoning over graphs.

Abstract

Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, an LLM-powered framework for generating high-quality Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over questions. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark popular solutions for KG-augmented LLMs on it.

Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

TL;DR

This work tackles the reliability gap of KGQA-enabled LLMs by introducing SynthKGQA, a framework that generates large-scale KGQA datasets with ground-truth subgraphs and SPARQL targets from any KG. The authors instantiate GTSQA on Wikidata, a 32k-question dataset designed to probe zero-shot generalization across unseen graph structures and relation types, and provide a comprehensive benchmark of LLM-only and KG-RAG models. They show that using ground-truth subgraphs as supervision signals for training KG retrievers yields substantial gains, especially for multi-hop questions, and that conventional shortest-path supervision is often inadequate. Overall, the work advances fair benchmarking and training signal quality for KG-RAG systems, contributing to more trustworthy LLM-based reasoning over graphs.

Abstract

Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, an LLM-powered framework for generating high-quality Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over questions. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark popular solutions for KG-augmented LLMs on it.

Paper Structure

This paper contains 36 sections, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: The steps performed by SynthKGQA to generate and validate questions and ground-truth subgraphs from an arbitrary Knowledge Graph.
  • Figure 2: Hits (EM) of KG-RAG models on different graph isomorphism types, compared to the baseline (GPT-4o-mini, no RAG). Isomorphism types are grouped by the maximum number of hops; inside each subplot, moving from left to right corresponds to an increase in the total number of edges in the ground-truth answer subgraph.
  • Figure 3: Generalization abilities of trainable KG-RAG models. We measure EM performance in terms of difference with the EM of the baseline (GPT-4o-mini, no RAG).
  • Figure 4: Correlation between the percentage of ground-truth (GT) triples contained in the set of shortest paths (SP) from seed to answer nodes, and EM Hits of GPT-4o-mini when augmented with all SP triples. We display the Pearson correlation coefficient and associated p-value; each dot represents a different isomorphism type of ground-truth answer subgraph in the test set of GTSQA: they are clearly clustered based on the (maximum) number of hops.
  • Figure B1: Frequency of relation types of edges in the ground-truth answer subgraphs of questions in GTSQA. The 168 least-occurring relation types (tail of the distribution) are reserved for questions in the test set, to test zero-shot generalization abilities of KG retriever models.
  • ...and 11 more figures