Table of Contents
Fetching ...

Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

Floris Vossebeld, Shenghui Wang

TL;DR

This work tackles multi-hop KGQA by reframing SPARQL construction as an iterative, agentic decision process. A compact LLM is fine-tuned with Group Relative Policy Optimization to learn a think–act–observe policy that improves query refinement through execution feedback, without supervised demonstrations. On a curated LC-QuAD 2.0 subset, the RL-tuned agent achieves strong gains in accuracy and query executability, and ablations show that deliberate reasoning provides a meaningful boost. The approach demonstrates how interaction with a symbolic knowledge graph can bridge probabilistic LLM reasoning and structured data, offering a generalizable blueprint for agentic, tool-using reasoning in KGQA and related symbolic tasks.

Abstract

Generating complex, logically-sound SPARQL queries for multi-hop questions remains a critical bottleneck for Knowledge Graph Question Answering, as the brittle nature of one-shot generation by Large Language Models (LLMs) hinders reliable interaction with structured data. Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback. This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction. We show that a compact 3B-parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, can learn effective policies for this task, discovering how to systematically recover from execution errors and refine its queries toward a correct answer. On a curated, executable single-answer subset of LC-QuAD 2.0, our agent achieves 49.7\% accuracy post-entity-linking, a significant 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Further analysis reveals that while the agent's capability is driven by RL, its performance is enhanced by an explicit deliberative reasoning step that acts as a cognitive scaffold to improve policy precision. This work presents a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.

Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

TL;DR

This work tackles multi-hop KGQA by reframing SPARQL construction as an iterative, agentic decision process. A compact LLM is fine-tuned with Group Relative Policy Optimization to learn a think–act–observe policy that improves query refinement through execution feedback, without supervised demonstrations. On a curated LC-QuAD 2.0 subset, the RL-tuned agent achieves strong gains in accuracy and query executability, and ablations show that deliberate reasoning provides a meaningful boost. The approach demonstrates how interaction with a symbolic knowledge graph can bridge probabilistic LLM reasoning and structured data, offering a generalizable blueprint for agentic, tool-using reasoning in KGQA and related symbolic tasks.

Abstract

Generating complex, logically-sound SPARQL queries for multi-hop questions remains a critical bottleneck for Knowledge Graph Question Answering, as the brittle nature of one-shot generation by Large Language Models (LLMs) hinders reliable interaction with structured data. Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback. This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction. We show that a compact 3B-parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, can learn effective policies for this task, discovering how to systematically recover from execution errors and refine its queries toward a correct answer. On a curated, executable single-answer subset of LC-QuAD 2.0, our agent achieves 49.7\% accuracy post-entity-linking, a significant 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Further analysis reveals that while the agent's capability is driven by RL, its performance is enhanced by an explicit deliberative reasoning step that acts as a cognitive scaffold to improve policy precision. This work presents a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.

Paper Structure

This paper contains 28 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The end-to-end agentic inference loop. The agent policy (LLM + QLoRA adapters) receives the current state (question and history) and generates reasoning (<think>) followed by an action: either a SPARQL query (<query>) or the final answer (<answer>). The SPARQL query is executed against the KG, with the outcome (<query_result>) updating the state for the next iteration. The loop terminates when the agent produces an answer.
  • Figure 2: The end-to-end RL fine-tuning cycle. A batch of questions is sampled, and for each, the agent (LLM + LoRA adapters) generates $G$ rollouts using the iterative inference loop. The composite reward $R(\tau)$ is computed for each trajectory. GRPO uses these rewards to calculate a policy gradient, which is then used to update the LoRA adapter weights.
  • Figure 3: Training dynamics of the RL-Tuned Agent over one epoch (40 training steps). The plots illustrate a layered learning process: the agent is optimized for reward (a), which drives improvements first in query syntax (c) and then in semantic accuracy (b). Simultaneously, the agent learns task efficiency, reducing the average number of turns required to find an answer (d).
  • Figure 4: Absolute counts of failure modes on the test set. Reinforcement learning dramatically reduces fundamental errors like execution failures and refusal to query. This shifts the primary challenge for the trained agents from generating valid syntax to formulating correct logical plans.