Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Shuo He; Lang Feng; Qi Wei; Xin Cheng; Lei Feng; Bo An

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An

TL;DR

HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts within a group of rollout trajectories, and can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts.

Abstract

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at https://github.com/langfengQ/verl-agent/tree/master/recipe/hgpo.

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

TL;DR

Abstract

Paper Structure (26 sections, 1 theorem, 19 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 19 equations, 9 figures, 7 tables, 1 algorithm.

Introduction
Related work
Preliminaries
Training Agents with HGPO for Long-horizon Agentic Tasks
The issue of historical context inconsistency
Hierarchy-of-Groups Policy Optimization
Experiments
Experiment Setup
Experimental results
Further Analysis
Parameter analysis
Ablation study
Conclusion
Algorithm
More details and proof for Theorem
...and 11 more sections

Key Result

Proposition 4.1

Let $b_k$ and $v_k$ denote the bias and variance of the estimated advantage $A_{k}^{H}$ within the $k$-th group $G_{k}^{H}$. Based on the following conditions: (1) Bias satisfies, i.e., $B_T \geq b_0 \geq (b_1,\cdots, b_{K-1} ) \geq b_K \geq 0$; (2) Variance satisfies, i.e., $v_0 \leq (v_1, \cdots, Furthermore, the bias and variance of the advantage estimator in HGPO satisfy that

Figures (9)

Figure 1: Figure (a) compares trajectory-wise and stepwise policy optimization frameworks. Given two example group trajectories, Figure (b) illustrates trajectory-level and step-level grouping with their corresponding advantage estimations. Best viewed in color.
Figure 2: Statistics of GRPO and GiGPO. Figures (a) and (b) present the advantage differences relative to Oracle advantages for GRPO and GiGPO, respectively. Figures (c) and (d) report the average group size and the proportion of Oracle steps, respectively.
Figure 3: Overview of HGPO. The LLM-based agent interacts with a set of environments initialized from the same state $\bm{s}_{0}$, producing four group trajectories (states with the same color are identical). HGPO comprises two key components: context-aware hierarchical grouping and adaptive weighted advantage computation. For illustration, consider the state $\bm{s}_{2}$ (purple). First, HGPO assigns $\bm{s}_{2}$ into three hierarchical groups according to its historical contexts. Then, it computes the final advantage estimate by adaptively aggregating the weighted advantages from these groups.
Figure 4: Distributions of hierarchical group sizes on ALFWorld and WebShop using Qwen2.5-1.5B-Instruct. "0/1/2/3/4-Context" indicates different hierarchical groups. The first two rows correspond to $K=2$, and the last row corresponds to $K=4$. The y-axis denotes the proportion.
Figure 5: The prompt template of ALFWorld agents.
...and 4 more figures

Theorems & Definitions (1)

Proposition 4.1: Bias-variance trade-off in HGPO

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

TL;DR

Abstract

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)