Table of Contents
Fetching ...

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han

TL;DR

VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants.

Abstract

Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

TL;DR

VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants.

Abstract

Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly : training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose ariance ontrolled olicy ptimization (), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5 while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.
Paper Structure (49 sections, 37 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 49 sections, 37 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Long-context, tool-integrated multi-turn RL comparing synchronous training ($k{=}0$) to a two-step policy lag ($k{=}2$) with VCPO versus sequence-level truncated importance sampling (TIS). Top: AIME-2025 validation accuracy vs. cumulative wall-clock time. VCPO matches the best synchronous accuracy $2.5\times$ faster ($\approx$42h vs. $\approx$105h) and continues to improve thereafter. Bottom: Gradient norm vs. training steps. The TIS run shows an instability characterized by a brief gradient-norm spike followed by rapid collapse.
  • Figure 2: Sequence-Level TIS Collapse. Qwen2.5-7B Base trained on MATH task with 10-step off-policy (PipelineRL-10). The ESS ratio first degrades and then collapses, leading to a spike in rollout--policy KL divergence and a sharp drop in both training reward and validation accuracy. See Appendix \ref{['app:math_hyperparameters']} for hyperparameters and training details.
  • Figure 3: Compute (left) and memory (right) overhead of baseline-aware updates for Qwen2.5-7B on 4$\times$H100 GPUs (TP=4) with a sequence length of 8192 tokens.
  • Figure 4: GSM8K with Qwen2-1.5B under PipelineRL-12 (high policy lag). Most baselines lead to training collapse (or crash, e.g. Geometric MIS masks all sequences and has no loss), while VCPO remains stable throughout training and matches synchronous performance. Training details and hyperparameters can be found in Appendix \ref{['app:gsm8k_hyperparameters']}.
  • Figure 5: Qwen2.5-7B under PipelineRL-10 (10 Steps Off-Policy) on Countdown and MATH-500. Across both tasks, sequence level truncated importance sampling (TIS) suffers ESS-ratio collapse followed by KL/gradient instability and degraded accuracy, whereas VCPO maintains healthy ESS and stable updates, reaching synchronous performance. Training details and hyperparameters are provided in Appendix \ref{['app:math_hyperparameters']}.
  • ...and 9 more figures