Table of Contents
Fetching ...

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong

TL;DR

HiPER introduces a two-level hierarchical RL framework for large language model agents that explicitly separates high-level planning from low-level execution via a Plan-Execute interface. It couples this structure with Hierarchical Advantage Estimation (HAE), a two-time-scale gradient estimator that provides unbiased, low-variance credit assignment across subgoal decisions and primitive actions. The method uses a PPO-style actor-critic objective with a shared backbone and two value heads, enabling joint optimization of planning and execution. Empirically, HiPER achieves state-of-the-art results on interactive benchmarks ALFWorld and WebShop with Qwen backbones, demonstrating faster convergence, stronger long-horizon performance, and robust subgoal-driven behavior. This work highlights the importance of explicit temporal abstraction for scalable RL of multi-turn LLM agents in sparse-reward environments.

Abstract

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

TL;DR

HiPER introduces a two-level hierarchical RL framework for large language model agents that explicitly separates high-level planning from low-level execution via a Plan-Execute interface. It couples this structure with Hierarchical Advantage Estimation (HAE), a two-time-scale gradient estimator that provides unbiased, low-variance credit assignment across subgoal decisions and primitive actions. The method uses a PPO-style actor-critic objective with a shared backbone and two value heads, enabling joint optimization of planning and execution. Empirically, HiPER achieves state-of-the-art results on interactive benchmarks ALFWorld and WebShop with Qwen backbones, demonstrating faster convergence, stronger long-horizon performance, and robust subgoal-driven behavior. This work highlights the importance of explicit temporal abstraction for scalable RL of multi-turn LLM agents in sparse-reward environments.

Abstract

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.
Paper Structure (36 sections, 6 theorems, 66 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 6 theorems, 66 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Assume the Plan-Execute policy is given by the conditionals $\pi_\theta(q_t \mid s_t,o_{t-1})$, $\pi_\theta(o_t \mid s_t)$ (invoked only when $q_t=1$), and $\pi_\theta(a_t \mid s_t,o_t)$. Then the gradient of eq:hrl_obj is where the advantages are defined by: $G_t:=\sum_{t'=t}^{T-1}\gamma^{t'-t}r_{t'}$ is the return-to-go, $T$ denotes the total number of environment steps, and expectations are t

Figures (8)

  • Figure 1: Overall performance on agentic benchmarks ALFWorld and WebShop, with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct as base models. Our method consistently outperforms all evaluated baseline methods, including the best known prior method GiGPO feng2025group, across both benchmarks and model sizes.
  • Figure 2: Overview of the HiPER framework. The upper panel illustrates standard flat RL for LLM agents, where a single policy operates at one time scale and chooses an action at every turn, often leading to brittle long-horizon behavior. For instance, the agent may prematurely head to the cabinet before picking up, and cleaning the cup. The lower panel presents our HiPER framework, built on two components: the Plan-Execute interface (Sec. \ref{['sec:interface']}), a structured agent interface that explicitly separates high-level planning and low-level execution; and we propose the hierarchical advantage estimation (Sec. \ref{['sec:hae']}), which aligns credit assignment with this two-level structure by propagating learning signals both within and across subgoal segments.
  • Figure 3: ALFWorld 7B Curves. From the validation curve, HiPER achieves roughly 2.8$\times$ speedup relative to PPO/GRPO. From the training curve, HiPER exhibits more stable training dynamics compared with PPO/GRPO, showing smaller oscillations.
  • Figure 4: HiPER Switching Behavior on ALFWorld. The switching frequency increases during early training, indicating a high-level exploration phase. After initial exploration, the switching frequency and segment length stabilizes.
  • Figure 5: ReAct prompt template of ALFWorld agents.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Theorem 4.1: Plan-Execute Gradient
  • Theorem 4.2
  • Theorem 4.3: Informal
  • Theorem 1.1
  • proof
  • Theorem 1.2
  • proof
  • Theorem 1.3
  • proof