Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

Haoran Wang; Zeshen Tang; Leya Yang; Yaoru Sun; Fang Wang; Siyu Zhang; Yeming Chen

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

Haoran Wang, Zeshen Tang, Leya Yang, Yaoru Sun, Fang Wang, Siyu Zhang, Yeming Chen

TL;DR

This work addresses the challenge of inter-level coordination in goal-conditioned HRL by introducing Guided Cooperation via Model-based Rollout (GCMR). GCMR combines a forward dynamics model for model-based off-policy correction, a gradient-penalty term with a model-informed upper bound to stabilize lower-level Q-functions, and a one-step rollout-based planning mechanism that uses the higher-level critic to guide the lower level. When integrated with the ACLG baseline, GCMR achieves state-of-the-art robustness and data efficiency on a suite of long-horizon, sparse-reward tasks, with ablations showing the gradient penalty and planning component as the primary drivers of improvement, while model-based relabeling alone is less impactful due to rollout errors. The results demonstrate the value of inter-level dynamics as a communication channel for hierarchical RL and point to future extensions in online, high-dimensional, and multi-robot settings.

Abstract

Goal-conditioned hierarchical reinforcement learning (HRL) presents a promising approach for enabling effective exploration in complex, long-horizon reinforcement learning (RL) tasks through temporal abstraction. Empirically, heightened inter-level communication and coordination can induce more stable and robust policy improvement in hierarchical systems. Yet, most existing goal-conditioned HRL algorithms have primarily focused on the subgoal discovery, neglecting inter-level cooperation. Here, we propose a goal-conditioned HRL framework named Guided Cooperation via Model-based Rollout (GCMR), aiming to bridge inter-layer information synchronization and cooperation by exploiting forward dynamics. Firstly, the GCMR mitigates the state-transition error within off-policy correction via model-based rollout, thereby enhancing sample efficiency. Secondly, to prevent disruption by the unseen subgoals and states, lower-level Q-function gradients are constrained using a gradient penalty with a model-inferred upper bound, leading to a more stable behavioral policy conducive to effective exploration. Thirdly, we propose a one-step rollout-based planning, using higher-level critics to guide the lower-level policy. Specifically, we estimate the value of future states of the lower-level policy using the higher-level critic function, thereby transmitting global task information downwards to avoid local pitfalls. These three critical components in GCMR are expected to facilitate inter-level cooperation significantly. Experimental results demonstrate that incorporating the proposed GCMR framework with a disentangled variant of HIGL, namely ACLG, yields more stable and robust policy improvement compared to various baselines and significantly outperforms previous state-of-the-art algorithms.

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

TL;DR

Abstract

Paper Structure (46 sections, 3 theorems, 28 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 46 sections, 3 theorems, 28 equations, 18 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Parameterized Rewards
Experience Replay for Off-Policy Learning
Adjacency Constraint
Landmark-based Planning
Disentangled Variant of HIGL kim2021landmark: Adjacency Constraint and Landmark-Guided pbased Planning (ACLG)
Related work
Transition Relabeling
Model Exploitation in Goal-conditioned HRL
Methods
Forward Dynamics Modeling
Off-Policy Correction via Model-based Rollouts
Gradient Penalty with a Model-Inferred Upper Bound
One-Step Rollout-based Planning
...and 31 more sections

Key Result

Proposition 1

Let $\pi^*(a_t|s_t)$ and $r^*(s_t,a_t)$ be the policy and the reward function in an MDP. Suppose there are the upper bounds of Frobenius norm of the policy and reward gradients w.r.t. input actions, i.e., $\Vert \frac{\partial \pi^*(a_{t+1}|s_{t+1})}{\partial a_t}\Vert_F \leq L_{\pi} < 1$ and $\Vert Where $N$ denotes the dimension of the action and $\gamma$ is the discount factor.

Figures (18)

Figure 1: Illustrations of HIGL kim2021landmark and adjacency constraint and landmark-guided planning (ACLG). In HIGL, coverage- and novelty-based landmarks are selected to form a map, from which the most urgent landmark is chosen as the next expected subgoal. Meanwhile, to ensure the reachability of subgoals, HIGL introduces the adjacent constraint. However, in HIGL, the entanglement between the adjacency constraint and landmark-based planning only compels the subgoals to move towards the selected landmark, without guaranteeing $c$-step adjacency with the current state. The ACLG decouples the two to provide a better balance between the adjacency and landmark-based planning.
Figure 2: One-step rollout-based planning endeavors to utilize global information to direct the behavioral policy. Our method steers lower-level policy towards valuable highlands with respect to the final goal (see red trajectory), surpassing the performance of general HRL (see blue trajectory).
Figure 3: Environments used in our experiments. In the maze-related tasks, the goal in each task is marked with a red arrow, and the black line represents a possible trajectory from the current state to the goal.
Figure 4: Ablation studies on landmark-related components. We measure the performance of ACLG by (a) varying number of landmarks and (b) varying balancing coefficient $\lambda^{\rm ACLG}_{\rm landmark}$ in Ant Maze (U-shape).
Figure 5: Impact of shift magnitude $\delta_{sg}$ on the performance of ACLG+GCMR in Large Ant Maze (U-shape). \ref{['org_goals']} illustrates the original subgoals, while \ref{['no_shift']}-\ref{['shift_40']} depict the relabeled subgoals without or with varying shift magnitude. \ref{['success_shift']} plots the learning curves of ACLG+GCMR on the Large Ant Maze (U-shape) with varying shift magnitude $\delta_{sg}$. In \ref{['success_shift']}, the result is averaged over five random seeds.
...and 13 more figures

Theorems & Definitions (8)

Proposition 1
proof
Remark 1
Lemma 1
proof
Remark 2
Proposition 2
proof

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

TL;DR

Abstract

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (8)