Table of Contents
Fetching ...

Learning to Reason Efficiently with Discounted Reinforcement Learning

Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, Karim Bouyarmane

TL;DR

The paper tackles the high computational cost of long reasoning traces in large reasoning models by framing verifier-based reasoning as a finite-horizon MDP and applying a discount on reasoning tokens with a discount factor $\gamma<1$. Leveraging Blackwell optimality, the authors show that, within restricted policy classes, there exists a $\gamma_{\mathrm{bw}}<1$ such that near-1 discounts yield policies that maximize undiscounted accuracy while minimizing the expected trajectory length among accuracy-maximizers. They provide a practical training recipe—discount only environment rewards, apply KL regularization with a moving reference, and match token budgets—then validate empirically that discounted GRPO achieves Pass@1 parity with undiscounted baselines while significantly shortening reasoning traces across GSM8K, MATH, and other benchmarks. The work offers theoretical guarantees for the shortest successful-path behavior in discounted, deterministic verifier MDPs and demonstrates a viable, deployment-aware path to more efficient reasoning with large language models.

Abstract

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

Learning to Reason Efficiently with Discounted Reinforcement Learning

TL;DR

The paper tackles the high computational cost of long reasoning traces in large reasoning models by framing verifier-based reasoning as a finite-horizon MDP and applying a discount on reasoning tokens with a discount factor . Leveraging Blackwell optimality, the authors show that, within restricted policy classes, there exists a such that near-1 discounts yield policies that maximize undiscounted accuracy while minimizing the expected trajectory length among accuracy-maximizers. They provide a practical training recipe—discount only environment rewards, apply KL regularization with a moving reference, and match token budgets—then validate empirically that discounted GRPO achieves Pass@1 parity with undiscounted baselines while significantly shortening reasoning traces across GSM8K, MATH, and other benchmarks. The work offers theoretical guarantees for the shortest successful-path behavior in discounted, deterministic verifier MDPs and demonstrates a viable, deployment-aware path to more efficient reasoning with large language models.

Abstract

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

Paper Structure

This paper contains 23 sections, 15 theorems, 60 equations, 1 figure, 2 tables.

Key Result

Theorem 3.4

Under Assumption mainass:finite-Pi, there exists $\gamma'\in[0,1)$ and a nonempty set $\Pi^\star_{\mathrm{bw}}\subseteq\Pi$ such that for all $\gamma\in(\gamma',1)$,

Figures (1)

  • Figure 1: GSM8K accuracy (blue, left) and tokens (orange, right) vs. discount $(1-\gamma)$.

Theorems & Definitions (30)

  • Definition 3.1
  • Definition 3.2: blackwell
  • Theorem 3.4
  • Definition 3.5
  • Lemma 3.6
  • proof
  • Lemma 3.7
  • proof
  • Theorem 3.9
  • Theorem 3.10
  • ...and 20 more