Table of Contents
Fetching ...

Toward Honest Language Models for Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

TL;DR

The paper tackles honest deductive reasoning by separating answerability from knowledge, and introduces GraphLA and GraphLI to test whether models can derive conclusions only when entailed and abstain otherwise. It shows that standard prompting and common RL/SFT methods struggle, especially as problem depth grows. The authors propose Anchor, a ground-truth trajectory–injected reinforcement learning approach, which unifies supervised and reinforcement signals to stabilize training and promote honest abstention. Across two datasets and multiple model scales, Anchor outperforms baselines and synergizes with curriculum learning to achieve robust honest reasoning. This work highlights the critical role of training dynamics in enabling reliable, abstaining reasoning in language models and provides practical methods for more trustworthy reasoning systems.

Abstract

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

Toward Honest Language Models for Deductive Reasoning

TL;DR

The paper tackles honest deductive reasoning by separating answerability from knowledge, and introduces GraphLA and GraphLI to test whether models can derive conclusions only when entailed and abstain otherwise. It shows that standard prompting and common RL/SFT methods struggle, especially as problem depth grows. The authors propose Anchor, a ground-truth trajectory–injected reinforcement learning approach, which unifies supervised and reinforcement signals to stabilize training and promote honest abstention. Across two datasets and multiple model scales, Anchor outperforms baselines and synergizes with curriculum learning to achieve robust honest reasoning. This work highlights the critical role of training dynamics in enabling reliable, abstaining reasoning in language models and provides practical methods for more trustworthy reasoning systems.

Abstract

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

Paper Structure

This paper contains 36 sections, 4 theorems, 19 equations, 4 figures, 5 tables.

Key Result

Proposition 1

Let the GRPO surrogate be defined in eq:grpo_objective. Suppose that in every group there exists a ground-truth rollout $y^\star = (y^\star_1,\dots,y^\star_{|y^\star|})$ whose standardized advantage satisfies $\hat{A}^\star > 0$. Then the policy gradient update $\nabla_\theta \mathcal{J}_{\text{GRPO where the clipped importance factor $\alpha_t(\theta)$ is The proof is given in appn:rq2_methodolo

Figures (4)

  • Figure 1: Performance of models on (a) answerable (top row) and (b) unanswerable (bottom row) instances in GraphLA, as a function of reasoning depth $k$ and number of variables $|V|$.
  • Figure 2: Performance of models on GraphLI instances as a function of reasoning depth $k$ and number of irrelevant edges $|E_{\text{irr}}|$. Since the task is binary classification, we report overall accuracy.
  • Figure 3: Gradient update statistics on GraphLI during training, comparing Anchor and GRPO. Each subplot reports the gradient norm (left) and upper clipping fraction (right).
  • Figure 4: Validation accuracy of Qwen-2.5-3B-Instruct (left) and Qwen-3-1.7B (right) on GraphLA comparing Easy-to-Hard training with GRPO. In the first stage, models are trained on an easier dataset with either $|V|=5$ or $|V|=8$. In the second stage, the same checkpoints are further trained on the target dataset with $|V|=15$.

Theorems & Definitions (5)

  • Proposition 1
  • Lemma 1: Interchange of gradient and expectation
  • Lemma 2: Log-derivative trick
  • Lemma 3: Subgradient of PPO-style clipping
  • proof : Proof of \ref{['thm:anchor']}