Table of Contents
Fetching ...

Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning

Vasilije Markovic, Lazar Obradovic, Laszlo Hajdu, Jovan Pavlovic

TL;DR

This work addresses hyperparameter sensitivity in graph-augmented retrieval-augmented generation by optimizing Cognee across ingestion, graph construction, retrieval, and prompting. It demonstrates that targeted tuning yields consistent performance gains on three multi-hop QA benchmarks, assessed via $EM$, $F1$, and $DeepEval$'s correctness metric, though improvements vary by task and metric. The study also reveals evaluation and generalization challenges, underscoring the need for clearer optimization frameworks in modular KG-LLM systems. Overall, it highlights the practical impact of systematic configuration tuning and sets directions for broader, multi-objective optimization and standardized benchmarks in graph-based RAG pipelines.

Abstract

Integrating Large Language Models (LLMs) with Knowledge Graphs (KGs) results in complex systems with numerous hyperparameters that directly affect performance. While such systems are increasingly common in retrieval-augmented generation, the role of systematic hyperparameter optimization remains underexplored. In this paper, we study this problem in the context of Cognee, a modular framework for end-to-end KG construction and retrieval. Using three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, and MuSiQue) we optimize parameters related to chunking, graph construction, retrieval, and prompting. Each configuration is scored using established metrics (exact match, F1, and DeepEval's LLM-based correctness metric). Our results demonstrate that meaningful gains can be achieved through targeted tuning. While the gains are consistent, they are not uniform, with performance varying across datasets and metrics. This variability highlights both the value of tuning and the limitations of standard evaluation measures. While demonstrating the immediate potential of hyperparameter tuning, we argue that future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems.

Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning

TL;DR

This work addresses hyperparameter sensitivity in graph-augmented retrieval-augmented generation by optimizing Cognee across ingestion, graph construction, retrieval, and prompting. It demonstrates that targeted tuning yields consistent performance gains on three multi-hop QA benchmarks, assessed via , , and 's correctness metric, though improvements vary by task and metric. The study also reveals evaluation and generalization challenges, underscoring the need for clearer optimization frameworks in modular KG-LLM systems. Overall, it highlights the practical impact of systematic configuration tuning and sets directions for broader, multi-objective optimization and standardized benchmarks in graph-based RAG pipelines.

Abstract

Integrating Large Language Models (LLMs) with Knowledge Graphs (KGs) results in complex systems with numerous hyperparameters that directly affect performance. While such systems are increasingly common in retrieval-augmented generation, the role of systematic hyperparameter optimization remains underexplored. In this paper, we study this problem in the context of Cognee, a modular framework for end-to-end KG construction and retrieval. Using three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, and MuSiQue) we optimize parameters related to chunking, graph construction, retrieval, and prompting. Each configuration is scored using established metrics (exact match, F1, and DeepEval's LLM-based correctness metric). Our results demonstrate that meaningful gains can be achieved through targeted tuning. While the gains are consistent, they are not uniform, with performance varying across datasets and metrics. This variability highlights both the value of tuning and the limitations of standard evaluation measures. While demonstrating the immediate potential of hyperparameter tuning, we argue that future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems.

Paper Structure

This paper contains 27 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Running maximum performance curves for Musique, TwoWikiMultiHop, and HotPotQA.