Table of Contents
Fetching ...

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding

TL;DR

This work addresses the lack of convergence guarantees for backbone RL algorithms in multi-turn, agentic RL settings used to train LLM-based agents. It analyzes combinations of advantage estimation and policy updates, revealing a fundamental trade-off between critic-free operation and convergence guarantees, and proposes SeeUPO, a sequence-level framework that reframes multi-turn interactions as sequential multi-agent bandits with reverse-order updates to enable backward induction and global optimality. SeeUPO inherits monotonic improvement from the HARL/HAML framework and demonstrates substantial empirical gains (e.g., 43.3%–54.6% on Qwen3-14B and 24.1%–41.9% on Qwen2.5-14B) with improved training stability on AppWorld and BFCL v4. These results show that carefully designed turn-wise credit assignment and update ordering can achieve both stability and optimality in complex multi-turn agentic tasks, offering practical impact for deploying robust autonomous AI agents.

Abstract

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

TL;DR

This work addresses the lack of convergence guarantees for backbone RL algorithms in multi-turn, agentic RL settings used to train LLM-based agents. It analyzes combinations of advantage estimation and policy updates, revealing a fundamental trade-off between critic-free operation and convergence guarantees, and proposes SeeUPO, a sequence-level framework that reframes multi-turn interactions as sequential multi-agent bandits with reverse-order updates to enable backward induction and global optimality. SeeUPO inherits monotonic improvement from the HARL/HAML framework and demonstrates substantial empirical gains (e.g., 43.3%–54.6% on Qwen3-14B and 24.1%–41.9% on Qwen2.5-14B) with improved training stability on AppWorld and BFCL v4. These results show that carefully designed turn-wise credit assignment and update ordering can achieve both stability and optimality in complex multi-turn agentic tasks, offering practical impact for deploying robust autonomous AI agents.

Abstract

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.
Paper Structure (75 sections, 18 theorems, 115 equations, 5 figures, 5 tables)

This paper contains 75 sections, 18 theorems, 115 equations, 5 figures, 5 tables.

Key Result

theorem 1

Monotonic Improvement under HAMLhaml_monotonic (Adapted from the Fundamental Theorem of HAML HARL) Let $\mathcal{N} = \{1, \ldots, n\}$ be the set of agents. For each agent $i \in \mathcal{N}$, let $\mathfrak{D}^i$ be a heterogeneous-agent drift functional satisfying the nonnegativity and zero gradi where $\beta_{\boldsymbol{\pi}} \in \mathcal{P}(\mathcal{S})$ is a positive sampling distribution.

Figures (5)

  • Figure 1: Performance comparison of training Qwen3-14B model on the AppWorld and BFCL-v4 benchmarks. (a)-(b) show training curves, (c)-(f) show test curves. SeeUPO algorithm demonstrates significantly stronger training stability and optimal performance compared to other backbone RL algorithms.
  • Figure 2: The core idea of SeeUPO is to abstract multi-turn interaction tasks into sequentially-decision multi-agent single-turn tasks, and adopt reverse-order sequential updates to achieve global optimality via backward induction. The figure shows an example scenario with three turns, from left to right showing the original task scenario, the multi-agent modeling of the scenario, and the reverse update mechanism based on MARL theory.
  • Figure 3: The batch construction approach of SeeUPO. Unlike methods that construct batches using entire trajectories or by concatenating sliced turns, SeeUPO implements a turn-oriented approach that separately organizes samples from identical turns. This figure demonstrates the divergent batch construction patterns between SeeUPO and the Vanilla approach under two tasks with maximum three-turn interactions, via React + Reasoning-Augmented Template paradigm zhai2025agentevolver.
  • Figure 4: Training success rate comparison of SeeUPO and baselines. Subplots (a)-(d) show results for Qwen-3 model on Appworld and BFCL, and Qwen-2.5 model on Appworld and BFCL, respectively.
  • Figure 5: Training dynamics comparison of different update order strategies: (a) Qwen-3 evaluated on AppWorld and (b) Qwen-3 evaluated on BFCL-v4.

Theorems & Definitions (36)

  • theorem 1
  • definition 1
  • theorem 2
  • proof
  • theorem 3
  • proof
  • theorem 4
  • lemma 1
  • proof
  • proof
  • ...and 26 more