Table of Contents
Fetching ...

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi

TL;DR

R2-KG addresses the fragility and cost of single-agent KG reasoning by introducing a dual-agent framework that delegates KG exploration to a low-capacity Operator and final judgment to a high-capacity Supervisor, coupled with an Abstention mechanism to ensure trustworthy outputs. The approach is plug-and-play and KG/task-agnostic, enabling reliable reasoning across diverse benchmarks with reduced dependence on expensive LLMs. Empirical results across WebQSP, CWQ, MetaQA 3-hop, CRONQUESTIONS, and FactKG show superior or competitive accuracy and high reliability, including a 100% hit rate on MetaQA 3-hop, while markedly reducing high-capacity LLM usage. A single-agent, strict self-consistency variant provides additional cost savings at the expense of some coverage, illustrating a practical spectrum of reliability–cost trade-offs. The work thus offers a scalable, reliable KG reasoning solution with tangible operational benefits for real-world deployments.

Abstract

Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

TL;DR

R2-KG addresses the fragility and cost of single-agent KG reasoning by introducing a dual-agent framework that delegates KG exploration to a low-capacity Operator and final judgment to a high-capacity Supervisor, coupled with an Abstention mechanism to ensure trustworthy outputs. The approach is plug-and-play and KG/task-agnostic, enabling reliable reasoning across diverse benchmarks with reduced dependence on expensive LLMs. Empirical results across WebQSP, CWQ, MetaQA 3-hop, CRONQUESTIONS, and FactKG show superior or competitive accuracy and high reliability, including a 100% hit rate on MetaQA 3-hop, while markedly reducing high-capacity LLM usage. A single-agent, strict self-consistency variant provides additional cost savings at the expense of some coverage, illustrating a practical spectrum of reliability–cost trade-offs. The work thus offers a scalable, reliable KG reasoning solution with tangible operational benefits for real-world deployments.

Abstract

Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.

Paper Structure

This paper contains 40 sections, 7 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: R2-KG: The two agents provide an 'Answer' only when they are confident enough to do so. If multiple attempts at exploration fail to gather sufficient information, it determines that it does not know and abstains from answering.
  • Figure 2: R2-KG solves multi-hop query through an iterative dialogue between a low-capacity Operator and a high-capacity Supervisor. The Operator gathers triples via GetRelation() and ExploreKG() calls, and all of the explored relations ($R_k$) and explored triples ($G_k$) are stacked in the Supervisor's Path Stacks at every step $k < T (iteration\;limit)$. According to the Path Stacks, if evidence is lacking for the verification, the Supervisor sends feedback to the Operator to pursue alternative paths or roll back to an earlier hop.
  • Figure 3: Changes in coverage, F1 Scores, and hit rate based on Iteration Limit
  • Figure 4: Successful Case in WebQSP. Supervisor effectively guides the model to extract a more relevant answer for the question. Operator, Server Response, Supervisor for each colored box.
  • Figure 5: Failure Case in WebQSP. Supervisor fails to infer, leading the Operator to invoke functions in the wrong format repeatedly. Operator, Server Response, Supervisor for each colored box.
  • ...and 4 more figures