Table of Contents
Fetching ...

A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

Larissa Pusch, Alexandre Courtiol, Tim Conrad

TL;DR

This work tackles the reliability of LLM-based QA in knowledge-heavy contexts by introducing a human-in-the-loop framework that translates natural language into auditable Cypher queries over Knowledge Graphs, with built-in explanations and iterative amendments. The approach couples a Generator, Executor, Explainer, and Amender to create transparent, controllable graph queries that non-experts can refine through natural language feedback. Across a synthetic Movie KG and two real KGs (MaRDI and Hyena), the authors quantify explanation quality, fault detection, and amendment efficiency, revealing substantial model-to-model variation and identifying temporal information as a common error source. The findings suggest that interactive, explainable KG QA can improve accuracy and trust, while also highlighting domain-dependent performance gaps that motivate future improvements in prompts, UI tooling, and cross-domain benchmarks.

Abstract

Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.

A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

TL;DR

This work tackles the reliability of LLM-based QA in knowledge-heavy contexts by introducing a human-in-the-loop framework that translates natural language into auditable Cypher queries over Knowledge Graphs, with built-in explanations and iterative amendments. The approach couples a Generator, Executor, Explainer, and Amender to create transparent, controllable graph queries that non-experts can refine through natural language feedback. Across a synthetic Movie KG and two real KGs (MaRDI and Hyena), the authors quantify explanation quality, fault detection, and amendment efficiency, revealing substantial model-to-model variation and identifying temporal information as a common error source. The findings suggest that interactive, explainable KG QA can improve accuracy and trust, while also highlighting domain-dependent performance gaps that motivate future improvements in prompts, UI tooling, and cross-domain benchmarks.

Abstract

Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
Paper Structure (44 sections, 11 figures, 15 tables)

This paper contains 44 sections, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Architecture Diagram; Purple Nodes are contributed by the user, orange nodes are modules, the blue one is the graph schema and the green nodes are pipeline outputs.
  • Figure 2: Schema of the synthetic Movie Knowledge Graph created as basis for the queries in the benchmark dataset. The node types are Person, Movie, Critic and City, the relationships are DIRECTED, ACTED_IN, HAS_FAVORITE and BIRTH_CITY.
  • Figure 3: Effects on explanation accuracy. Left column: year mismatches counted as errors (strict criterion). Right column: year mismatches not counted as errors (relaxed criterion). These figures were constructed using an additive GLM. If items share a letter, they are not significantly different from each other.
  • Figure 4: How correctness of one-sentence summaries is influenced by query features.
  • Figure 5: For each perturbation category (definitions in \ref{['p:injected_errors']}), the bars show the fraction of queries in which each evaluated LLM correctly signalled that something was wrong.
  • ...and 6 more figures