What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge
Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov
TL;DR
This paper tackles the problem of evaluating KG-RAG systems under knowledge incompleteness, a realistic challenge for real-world knowledge graphs. It introduces a general benchmark construction pipeline that mines high-confidence Horn rules with AMIE3, removes inferable triples, and generates natural-language questions whose answers require reasoning over alternative KG paths. An evaluation protocol standardizes metrics (e.g., Hits@Any, F1, Hits@Hard) and postprocessing to ensure fair cross-study comparisons, while experiments across Family, FB15k-237, and Wikidata5m reveal that current KG-RAG models struggle when direct evidence is missing and often rely on textual labels and memorized surface forms. The findings point to important directions for future work, including retrieval strategies for alternative paths, robust reasoning modules that generalize beyond specific relation patterns, and improved evaluation practices that align with real-world incompleteness.
Abstract
Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.
