Table of Contents
Fetching ...

Reasoning without Regret

Tarun Chitra

TL;DR

Backwards Adaptive Reward Shaping (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals, is introduced, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Abstract

Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(Δ, ε)$-gap rewards, our backward Euler solver achieves $ε$-accuracy in $O\left((R_{\max}/Δ)\log(1/ε)\right)$ iterations with $O(\log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Reasoning without Regret

TL;DR

Backwards Adaptive Reward Shaping (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals, is introduced, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Abstract

Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and -gap rewards, our backward Euler solver achieves -accuracy in iterations with dynamic regret over rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Paper Structure

This paper contains 67 sections, 2 theorems, 63 equations, 2 figures, 2 algorithms.

Key Result

Theorem A.1

Let $\mathcal{T}^{\delta}$ be a family of discrete operators approximating a continuous operator $\overline{\mathcal{T}}$. Suppose: Then, if $u^{\delta}$ is a solution to $\mathcal{T}^{\delta}[u^{\delta}] = 0$, the sequence $\{u^{\delta}\}$ converges uniformly on compact sets to the unique viscosity solution $u$ of $\overline{\mathcal{T}}u = 0$ as $\delta \to 0$.

Figures (2)

  • Figure 1: Flowchart of the results proved in in this paper
  • Figure 2: Comparison of forward and backward iteration approaches for sparse rewards. Forward iteration (left) requires extensive exploration with many failed paths before finding optimal paths to rewards. Backward iteration (right) propagates value directly from rewards with stronger gradients and less variance near the goal point, resulting in more efficient computation. The blue dots in the forward panel represent discrete steps of forward Bellman value iteration, while the magenta dots in the backward panel represent steps of backward Euler BSDE iteration. The continuous paths represent their scaling limits as $\delta \to 0$, illustrating the key finding that backward iteration requires $\tau^-_\epsilon = \Theta(\gamma_2(\mathrm{supp}(r),d)^2/\epsilon^2)$ steps compared to forward iteration's $\tau^+_\epsilon = \Theta(\gamma_2(S,d)^2/\epsilon^2)$ steps.

Theorems & Definitions (21)

  • Definition 2.1
  • Claim 2.2
  • Claim 3.1
  • Claim 3.2
  • Claim 4.1: Low Coupling Error between MDP and Controlled Diffusion
  • Claim 4.2: Hitting Time of Forward Iteration
  • Claim 4.3: Hitting Time of Backward Iteration
  • Claim 4.4: Ratio of Hitting Times
  • Claim 4.5: Minimum Reward for Accurate Discretization
  • Claim 4.6: Effective Reward Mass Lower Bound
  • ...and 11 more