Table of Contents
Fetching ...

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo

TL;DR

KnowCoder-A1 addresses the fragility and limited exploration of process-supervised agentic KBQA by learning from outcome-only supervision through a multi-stage curriculum RL framework. It bootstraps with a small, high-quality outcome-curated dataset and then progressively strengthens autonomous reasoning via GRPO with an easy-to-hard reward schedule. Empirical results across WebQSP, CWQ, and GrailQA show state-of-the-art performance in low-resource settings and strong zero-shot generalization, while maintaining efficiency. The work highlights robust recovery from errors and flexible reasoning trajectories as key advantages of outcome-driven exploration in agentic KBQA.

Abstract

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

TL;DR

KnowCoder-A1 addresses the fragility and limited exploration of process-supervised agentic KBQA by learning from outcome-only supervision through a multi-stage curriculum RL framework. It bootstraps with a small, high-quality outcome-curated dataset and then progressively strengthens autonomous reasoning via GRPO with an easy-to-hard reward schedule. Empirical results across WebQSP, CWQ, and GrailQA show state-of-the-art performance in low-resource settings and strong zero-shot generalization, while maintaining efficiency. The work highlights robust recovery from errors and flexible reasoning trajectories as key advantages of outcome-driven exploration in agentic KBQA.

Abstract

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

Paper Structure

This paper contains 37 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The key limitations of existing agentic approaches that rely on process supervision.
  • Figure 2: An overview of the training framework of KnowCoder-A1 . Stage 1 (left): the SFT-based cold-start process, where high-quality trajectories are curated from strong LLMs to fine-tune an initial agent. Stage 2 (right): the multi-phase Reinforcement Learning curriculum, where the agent is progressively improved through exploration and a dynamic reward strategy.
  • Figure 3: Training curves for KnowCoder-A1, illustrating: (a) training reward, (b) response length, (c) interaction turns, and (d) the number of invalid tool calls.
  • Figure 4: Evolution of Robustness and Flexibility during training: (a) Robustness, shown by the composition of rollout trajectories, and (b) Flexibility, shown by the number of unique SPARQL queries per question.
  • Figure 5: Joint distribution of mean empty-result/error frequency and mean reward across training phases.
  • ...and 1 more figures