Table of Contents
Fetching ...

KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs

Elan Markowitz, Krupa Galiya, Greg Ver Steeg, Aram Galstyan

TL;DR

KG-LLM-Bench introduces a scalable, extensible benchmark to evaluate how LLMs reason over textualized knowledge graphs across five tasks. It systematically compares five KG textualization formats, seven LLMs, and a pseudonymization regime, revealing that encoding choices significantly influence performance and token efficiency. The framework combines subgraph sampling, deterministic question generation, and exact-match scoring to provide actionable insights into how to optimize KG reasoning in practice. The results offer practical guidance for designing knowledge-augmented LLM systems and highlight directions for future research in scalable KG reasoning and test-time inference.

Abstract

Knowledge graphs have emerged as a popular method for injecting up-to-date, factual knowledge into large language models (LLMs). This is typically achieved by converting the knowledge graph into text that the LLM can process in context. While multiple methods of encoding knowledge graphs have been proposed, the impact of this textualization process on LLM performance remains under-explored. We introduce KG-LLM-Bench, a comprehensive and extensible benchmark spanning five knowledge graph understanding tasks, and evaluate how different encoding strategies affect performance across various base models. Our extensive experiments with seven language models and five textualization strategies provide insights for optimizing LLM performance on KG reasoning tasks.

KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs

TL;DR

KG-LLM-Bench introduces a scalable, extensible benchmark to evaluate how LLMs reason over textualized knowledge graphs across five tasks. It systematically compares five KG textualization formats, seven LLMs, and a pseudonymization regime, revealing that encoding choices significantly influence performance and token efficiency. The framework combines subgraph sampling, deterministic question generation, and exact-match scoring to provide actionable insights into how to optimize KG reasoning in practice. The results offer practical guidance for designing knowledge-augmented LLM systems and highlight directions for future research in scalable KG reasoning and test-time inference.

Abstract

Knowledge graphs have emerged as a popular method for injecting up-to-date, factual knowledge into large language models (LLMs). This is typically achieved by converting the knowledge graph into text that the LLM can process in context. While multiple methods of encoding knowledge graphs have been proposed, the impact of this textualization process on LLM performance remains under-explored. We introduce KG-LLM-Bench, a comprehensive and extensible benchmark spanning five knowledge graph understanding tasks, and evaluate how different encoding strategies affect performance across various base models. Our extensive experiments with seven language models and five textualization strategies provide insights for optimizing LLM performance on KG reasoning tasks.

Paper Structure

This paper contains 49 sections, 9 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Different formats for graph textualization can result in highly varied performance on downstream tasks.
  • Figure 2: Framework for KG-LLM-Bench.
  • Figure 3: Heatmaps of the performance of various models. Each heatmap shows tasks as rows and textualize functions as columns. (Top) Heatmap colors as globally weighted from [0.0-1.0]. (bottom) heatmap colors normalized for each task [task minimum-task maximum]. The tasks are ordered from easiest overall to hardest. The textualization functions are ordered from best performing overall to worst. Additional models are in the appendix.
  • Figure 4: Performance analysis across tasks: (a) comparison of textualization strategies and (b) performance by model. The metric for (a) shows absolute difference in accuracy for each strategy compared to the mean for that task. The mean is shown as the dashed circle.
  • Figure 5: Impact of pseudonymization by task. Higher means that the model did better with pseudonymization. Each color represents a different model.
  • ...and 7 more figures