KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs
Elan Markowitz, Krupa Galiya, Greg Ver Steeg, Aram Galstyan
TL;DR
KG-LLM-Bench introduces a scalable, extensible benchmark to evaluate how LLMs reason over textualized knowledge graphs across five tasks. It systematically compares five KG textualization formats, seven LLMs, and a pseudonymization regime, revealing that encoding choices significantly influence performance and token efficiency. The framework combines subgraph sampling, deterministic question generation, and exact-match scoring to provide actionable insights into how to optimize KG reasoning in practice. The results offer practical guidance for designing knowledge-augmented LLM systems and highlight directions for future research in scalable KG reasoning and test-time inference.
Abstract
Knowledge graphs have emerged as a popular method for injecting up-to-date, factual knowledge into large language models (LLMs). This is typically achieved by converting the knowledge graph into text that the LLM can process in context. While multiple methods of encoding knowledge graphs have been proposed, the impact of this textualization process on LLM performance remains under-explored. We introduce KG-LLM-Bench, a comprehensive and extensible benchmark spanning five knowledge graph understanding tasks, and evaluate how different encoding strategies affect performance across various base models. Our extensive experiments with seven language models and five textualization strategies provide insights for optimizing LLM performance on KG reasoning tasks.
