Table of Contents
Fetching ...

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo

TL;DR

SFPO tackles instability in on-policy reasoning RL for LLMs by restructuring updates into a fast trajectory, a reposition step, and a slow correction, while keeping the original objective and rollout process intact. The method introduces an adaptive alpha schedule to balance exploitation of stabilized gradients with stability, yielding consistent improvements over GRPO in math reasoning benchmarks and substantial reductions in rollout requirements and training time. Theoretical intuition explains how the three stages mitigate variance and drift, and empirical results demonstrate faster convergence, improved stability, and lower resource costs across multiple models and datasets. SFPO offers a practical, drop-in enhancement for LLM reasoning pipelines with potential for curriculum or meta-learning extensions.

Abstract

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

TL;DR

SFPO tackles instability in on-policy reasoning RL for LLMs by restructuring updates into a fast trajectory, a reposition step, and a slow correction, while keeping the original objective and rollout process intact. The method introduces an adaptive alpha schedule to balance exploitation of stabilized gradients with stability, yielding consistent improvements over GRPO in math reasoning benchmarks and substantial reductions in rollout requirements and training time. Theoretical intuition explains how the three stages mitigate variance and drift, and empirical results demonstrate faster convergence, improved stability, and lower resource costs across multiple models and datasets. SFPO offers a practical, drop-in enhancement for LLM reasoning pipelines with potential for curriculum or meta-learning extensions.

Abstract

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.

Paper Structure

This paper contains 32 sections, 12 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Pipeline of SFPO at step $s$. Starting from the current policy $\pi_{\theta^{s,0}}$, we first generate rollouts for training. Stage I (Fast Trajectory): apply $K$ successive gradient updates on the same batch to obtain $\theta^{s,K}$. Stage II (Reposition): interpolate between $\theta^{s,K}$ and the starting point $\theta^{s,0}$ to form $\widetilde{\theta}^{s,K}$, controlling off-policy drift. Stage III (Slow Correction): perform one additional update on $\widetilde{\theta}^{s,K}$, yielding $\pi_{\theta^{s+1,0}}$ for the next step.
  • Figure 2: Average validation accuracy of different base models throughout the training process.
  • Figure 3: Training dynamics for DeepSeek-R1-Distilled-Qwen-7B, comparing GRPO and SFPO across response length, entropy loss, and reward.
  • Figure 4: Comparison of GRPO and SFPO. (a) Number of rollouts required to achieve the best accuracy of GRPO. (b) Corresponding training time.
  • Figure 5: Average training accuracy of different settings throughout the training process. (a): Small k=3 with varying values of $\alpha$. (b): Large k=7 with varying values of $\alpha$. (c): Varying values of k with fixed $\alpha=0.8$. (d): Impact of the existence of stage III.
  • ...and 7 more figures