ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang; Han Zhang; Haixin Wang; Yidan Shi; Ruoyan Li; Kaiqiao Han; Chenyi Tong; Haoran Deng; Renliang Sun; Alexander Taylor; Yanqiao Zhu; Jason Cong; Yizhou Sun; Wei Wang

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, Wei Wang

TL;DR

This paper proposes ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting and proposes SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL.

Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

TL;DR

Abstract

Paper Structure (94 sections, 33 equations, 11 figures, 8 tables)

This paper contains 94 sections, 33 equations, 11 figures, 8 tables.

Introduction
Problem Formulation
Policy Gradient for Agentic RL
Agentic RL.
Policy Gradient Decomposition Dimensions
Loss Aggregation.
IS Clipping.
Trajectory Filtering and Resampling.
Advantage Design.
Experimental Setup
Standardized Testbed
(1) Behavior Cloning.
(2) Format Penalty.
(3) Auxiliary KL Loss.
(4) PO-specific Hyper-parameter Grid Search.
...and 79 more sections

Figures (11)

Figure 1: Overview of ARLArena. Part 1: A standardized testbed via behavior cloning, format penalty, KL regularization, and hyperparameter search. Part 2: Policy gradient decomposition into four dimensions with representative methods mapped to each. Part 3: Key findings on training stability and collapse modes. Part 4: Insights unified into SAMPO for stable ARL training.
Figure 2: Training curves on ALFWorld (left) and Sokoban (right). SAMPO (ours) achieves the highest success rates on both environments with stable, monotonic improvement throughout training, while baseline methods exhibit varying degrees of instability. These results demonstrate that principled integration of sequence-level clipping, advantage design, and dynamic filtering, as combined in SAMPO, is critical for both training stability and final performance in multi-turn agentic RL.
Figure 3: Training dynamics of six IS variants on ALFWorld: GRPO, GSPO, SAPO, CISPO, and their sequence-masked counterparts $\text{SAPO}_{\texttt{SM}}$ and $\text{CISPO}_{\texttt{SM}}$. Panels show (from left to right) success rate, off-policy KL divergence between the current and behavior policies, KL loss between the current and reference policies, gradient norm, and valid-format ratio of rollout actions.
Figure 4: Token-level and sequence-level IS analysis of SAPO and its sequence-masked variant $\text{SAPO}_{\texttt{SM}}$. (a, b) Fraction of tokens with importance ratios outside the clipping range, decomposed into lower-bound (negative advantage) and upper-bound (positive advantage) portions. (c, d) Rollout groups partitioned by advantage sign, entropy level, and IS ratio magnitude, with KL divergence normalized for relative comparison.
Figure 5: Sequence-Level IS Analysis of CISPO and CISPO$_{\texttt{SM}}$ (CISPO with sequence masking) on ALFWorld.
...and 6 more figures

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

TL;DR

Abstract

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)