Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang; Zheng Wang; Jie Fu; Xingwei Qu; Qi Cheng; Shengpu Tang; Minjia Zhang; Xiaoming Huo

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo

TL;DR

SFPO tackles instability in on-policy reasoning RL for LLMs by restructuring updates into a fast trajectory, a reposition step, and a slow correction, while keeping the original objective and rollout process intact. The method introduces an adaptive alpha schedule to balance exploitation of stabilized gradients with stability, yielding consistent improvements over GRPO in math reasoning benchmarks and substantial reductions in rollout requirements and training time. Theoretical intuition explains how the three stages mitigate variance and drift, and empirical results demonstrate faster convergence, improved stability, and lower resource costs across multiple models and datasets. SFPO offers a practical, drop-in enhancement for LLM reasoning pipelines with potential for curriculum or meta-learning extensions.

Abstract

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

TL;DR

Abstract

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)