A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering
Larissa Pusch, Alexandre Courtiol, Tim Conrad
TL;DR
This work tackles the reliability of LLM-based QA in knowledge-heavy contexts by introducing a human-in-the-loop framework that translates natural language into auditable Cypher queries over Knowledge Graphs, with built-in explanations and iterative amendments. The approach couples a Generator, Executor, Explainer, and Amender to create transparent, controllable graph queries that non-experts can refine through natural language feedback. Across a synthetic Movie KG and two real KGs (MaRDI and Hyena), the authors quantify explanation quality, fault detection, and amendment efficiency, revealing substantial model-to-model variation and identifying temporal information as a common error source. The findings suggest that interactive, explainable KG QA can improve accuracy and trust, while also highlighting domain-dependent performance gaps that motivate future improvements in prompts, UI tooling, and cross-domain benchmarks.
Abstract
Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
