Table of Contents
Fetching ...

ASPO: Asymmetric Importance Sampling Policy Optimization

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai

TL;DR

This work identifies a fundamental flaw in GRPO-based OSRL for LLMs: the token-level clipping creates a misallocation of learning weights, disproportionately overweighting positive-advantage tokens and causing entropy collapse. The authors propose ASPO, which flips the IS ratios for positive tokens and introduces a soft dual clipping step (AIS) to stabilize updates while preserving gradient flow. Through extensive coding and mathematical reasoning benchmarks, ASPO demonstrates reduced premature convergence, smoother training dynamics, and superior final performance over strong GRPO baselines. The approach advances OSRL by reframing token weighting as a core learning signal and provides practical improvements for robust LLM RL training with publicly available code.

Abstract

Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

ASPO: Asymmetric Importance Sampling Policy Optimization

TL;DR

This work identifies a fundamental flaw in GRPO-based OSRL for LLMs: the token-level clipping creates a misallocation of learning weights, disproportionately overweighting positive-advantage tokens and causing entropy collapse. The authors propose ASPO, which flips the IS ratios for positive tokens and introduces a soft dual clipping step (AIS) to stabilize updates while preserving gradient flow. Through extensive coding and mathematical reasoning benchmarks, ASPO demonstrates reduced premature convergence, smoother training dynamics, and superior final performance over strong GRPO baselines. The approach advances OSRL by reframing token weighting as a core learning signal and provides practical improvements for robust LLM RL training with publicly available code.

Abstract

Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

Paper Structure

This paper contains 46 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of test accuracy and training dynamics between DAPO and DAPO without IS.
  • Figure 2: Curves of response-level IS ratios throughout DAPO training. The average IS ratios are shown in gray dashed lines.
  • Figure 3: 3D and 2D visualization of IS weights in PPO-Clip.
  • Figure 4: Comparison of DAPO and DAPO with positive samples using response-level IS weights.
  • Figure 5: Comparison among DAPO, DAPO with response-level IS weights for positive samples, and ASPO.