Table of Contents
Fetching ...

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

TL;DR

The paper tackles instability in off-policy reinforcement learning for large language models caused by distribution shifts between training and inference. It introduces VESPO, a variational, measure-change-based framework that yields a closed-form, smooth kernel for reshaping sequence-level importance weights without length normalization, overcoming the variance-bias trade-offs of prior methods. The resulting kernel $\phi(W) = W^{\alpha} e^{-\lambda W}$ (implemented as a shifted form $\phi(W)=W^{c_1} e^{c_2(1-W)}$) provides principled variance control and preserves sequence-level credit assignment, with asymmetric handling for positive vs negative advantages. Empirical results on mathematical reasoning benchmarks show VESPO delivering stable training under up to $64\times$ policy staleness and full asynchrony, outperforming GRPO, GSPO, and SAPO across dense and MoE models, and remaining compatible with techniques like R2 and TIS. This approach offers practical robustness for large-scale, off-policy LLM training and points to future scaling and multi-turn extensions.

Abstract

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

TL;DR

The paper tackles instability in off-policy reinforcement learning for large language models caused by distribution shifts between training and inference. It introduces VESPO, a variational, measure-change-based framework that yields a closed-form, smooth kernel for reshaping sequence-level importance weights without length normalization, overcoming the variance-bias trade-offs of prior methods. The resulting kernel (implemented as a shifted form ) provides principled variance control and preserves sequence-level credit assignment, with asymmetric handling for positive vs negative advantages. Empirical results on mathematical reasoning benchmarks show VESPO delivering stable training under up to policy staleness and full asynchrony, outperforming GRPO, GSPO, and SAPO across dense and MoE models, and remaining compatible with techniques like R2 and TIS. This approach offers practical robustness for large-scale, off-policy LLM training and points to future scaling and multi-turn extensions.

Abstract

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO
Paper Structure (29 sections, 2 theorems, 40 equations, 11 figures, 3 tables)

This paper contains 29 sections, 2 theorems, 40 equations, 11 figures, 3 tables.

Key Result

Proposition 2.1

The solution to is given by where $\lambda \geq 0$ is the Lagrange multiplier for the moment constraint and $Z$ is the normalization constant.

Figures (11)

  • Figure 1: Left: VESPO reformulates IS weight reshaping as finding a proposal $Q^*$ that balances proximity to $\mu$ and $\pi$ under a variance constraint. Right: Training reward (gbs/mbs=4) on Qwen3-30B-A3B-Base.
  • Figure 2: Surrogate objectives $f(w)$ (top) and gradient scaling factors $\phi(w) = w \cdot f'(w)$ (bottom) for positive and negative advantages. Hard clipping zeros $\phi$ abruptly at the boundary; VESPO peaks near $w{=}1$ and decays smoothly.
  • Figure 3: Training reward across staleness levels ($N \in \{4, 8, 16, 32, 64\}$) on Qwen3-30B-A3B-Base. Each panel shows one method with different $N$ values. VESPO maintains stable, consistent training curves across all staleness levels.
  • Figure 4: Training dynamics across staleness levels ($N$ = gbs/mbs $\in \{4, 8, 16, 32, 64\}$) on Qwen3-30B-A3B-Base. Each row corresponds to a different $N$; columns show training reward, AIME25 accuracy, response length, KL divergence, entropy, and PG loss. VESPO (red) maintains stable training across all conditions, while baselines exhibit characteristic failure modes.
  • Figure 5: Training dynamics under fully asynchronous training on Qwen3-30B-A3B-Base. VESPO maintains stable training and achieves the highest reward and benchmark accuracy.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Remark 1.1: When Token-Level Approximation is Reasonable
  • Proposition 2.1: Solution to the Constrained Problem
  • proof
  • Proposition 3.1: Conflation under Length Normalization