Table of Contents
Fetching ...

Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning

Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo

Abstract

Reinforcement Learning (RL) has emerged as the key driver for post-training complex reasoning in Large Language Models (LLMs), yet online RL introduces significant instability and computational overhead. Offline RL offers a compelling alternative by decoupling inference from training; however, offline algorithms for reasoning remain under-optimized compared to their online counterparts. A central challenge is gradient entanglement: in long-horizon reasoning trajectories, correct and incorrect solutions share substantial token overlap, causing gradient updates from incorrect trajectories to suppress tokens critical for correct ones. We propose Future Policy Approximation (FPA), a simple method that weights gradients against an estimate of the future policy rather than the current one, enabling proactive gradient reweighting. This future policy is estimated via logit-space extrapolation with negligible overhead. We provide theoretical intuition for FPA through the lens of Optimistic Mirror Descent and further ground it through its connection to DPO. Evaluating FPA across three models and seven mathematical benchmarks, we demonstrate consistent improvements over strong offline baselines including DPO, RPO, KTO, and vanilla offline RL. FPA stabilizes long-horizon training where vanilla objectives degrade and achieves comparable accuracy to online RLVR at a fraction of its GPU hours.

Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning

Abstract

Reinforcement Learning (RL) has emerged as the key driver for post-training complex reasoning in Large Language Models (LLMs), yet online RL introduces significant instability and computational overhead. Offline RL offers a compelling alternative by decoupling inference from training; however, offline algorithms for reasoning remain under-optimized compared to their online counterparts. A central challenge is gradient entanglement: in long-horizon reasoning trajectories, correct and incorrect solutions share substantial token overlap, causing gradient updates from incorrect trajectories to suppress tokens critical for correct ones. We propose Future Policy Approximation (FPA), a simple method that weights gradients against an estimate of the future policy rather than the current one, enabling proactive gradient reweighting. This future policy is estimated via logit-space extrapolation with negligible overhead. We provide theoretical intuition for FPA through the lens of Optimistic Mirror Descent and further ground it through its connection to DPO. Evaluating FPA across three models and seven mathematical benchmarks, we demonstrate consistent improvements over strong offline baselines including DPO, RPO, KTO, and vanilla offline RL. FPA stabilizes long-horizon training where vanilla objectives degrade and achieves comparable accuracy to online RLVR at a fraction of its GPU hours.

Paper Structure

This paper contains 50 sections, 1 theorem, 41 equations, 8 figures, 6 tables.

Key Result

Proposition A.1

The gradient of SimPER's objective equals the gradient of the Off-RL as follows: $\blacktriangleleft$$\blacktriangleleft$

Figures (8)

  • Figure 1: Overview of FPA.(Left) Conceptual overview of policy training. Naive updates are reactive, responding to over-penalization only after it occurs. FPA proactively estimates the future policy, anticipating and avoiding over-penalization beforehand. (Right) On Llama-3.2-3B-Instruct, naive offline RL suffers performance collapse due to over-penalization, whereas FPA ensures stable training.
  • Figure 2: Training Dynamics. (Left) Log-probability difference $\log \pi_\theta (y\mid x) - \log\pi_\text{ref}(y\mid x)$ during training for correct and incorrect trajectories for Off-RL and FPA. (Center) The policy probability used for gradients during training, $\pi$ for offline RL and $\hat{\pi}$ for FPA. (Right) The ratio $\hat{\pi}$ from FPA and $\pi$ from offline RL for incorrect trajectories during training.
  • Figure 3: Training dynamics under prolonged training. (Left) Log-probabilities $\log\pi_\theta(y\mid x) - \log\pi_\text{ref}(y\mid x)$ for correct and incorrect trajectories under Off-RL and FPA. (Center) Accuracy on MATH500 throughout training. (Right) The number of times $\pi$ and $\hat{\pi}$ fall below 0.2, indicating strong incorrect gradient regularization.
  • Figure 4: Ablation on targeted FPA. (Left) Log-probabilities $\log\pi_\theta(y \mid x) - \log\pi_\text{ref}(y\mid x)$ for w/ Cor. Only and w/ Incor. Only. (Center) Accuracy on MATH500 throughout training. (Right) Gradient norm during training.
  • Figure 5: $\lambda$ Sensitivity on MATH500 and Average accuracy of FPA across $\lambda \in \{0, 0.5, 1, 2, 5, 10\}$.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition A.1: SimPER aligns with Off-RL
  • proof