Table of Contents
Fetching ...

Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

Yu Luo, Fuchun Sun, Tianying Ji, Xianyuan Zhan

TL;DR

Subgoal-based HRL often suffers from unilateral reachability between high- and low-level policies, causing inefficiencies and local optima. The BrHPO framework introduces a mutual response mechanism for bidirectional subgoal reachability, built on a joint value function and a performance-difference bound that motivates coordinated updates. High-level optimization is regularized by $\mathcal{R}^{\pi_h,\pi_l}_i$ and low-level optimization uses a surrogate reward $\hat{r}_l = r_l - \lambda_2 \mathcal{R}^{\pi_h,\pi_l}_i$, enabling cross-level error correction with modest computation. On six long-horizon tasks, BrHPO outperforms state-of-the-art HRL baselines and maintains training efficiency close to flat SAC, illustrating improved exploration and robustness in sparse and dense reward settings.

Abstract

Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by skillfully decomposing them into subgoals. Therefore, the effectiveness of HRL is greatly influenced by subgoal reachability. Typical HRL methods only consider subgoal reachability from the unilateral level, where a dominant level enforces compliance to the subordinate level. However, we observe that when the dominant level becomes trapped in local exploration or generates unattainable subgoals, the subordinate level is negatively affected and cannot follow the dominant level's actions. This can potentially make both levels stuck in local optima, ultimately hindering subsequent subgoal reachability. Allowing real-time bilateral information sharing and error correction would be a natural cure for this issue, which motivates us to propose a mutual response mechanism. Based on this, we propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO)--a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.

Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

TL;DR

Subgoal-based HRL often suffers from unilateral reachability between high- and low-level policies, causing inefficiencies and local optima. The BrHPO framework introduces a mutual response mechanism for bidirectional subgoal reachability, built on a joint value function and a performance-difference bound that motivates coordinated updates. High-level optimization is regularized by and low-level optimization uses a surrogate reward , enabling cross-level error correction with modest computation. On six long-horizon tasks, BrHPO outperforms state-of-the-art HRL baselines and maintains training efficiency close to flat SAC, illustrating improved exploration and robustness in sparse and dense reward settings.

Abstract

Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by skillfully decomposing them into subgoals. Therefore, the effectiveness of HRL is greatly influenced by subgoal reachability. Typical HRL methods only consider subgoal reachability from the unilateral level, where a dominant level enforces compliance to the subordinate level. However, we observe that when the dominant level becomes trapped in local exploration or generates unattainable subgoals, the subordinate level is negatively affected and cannot follow the dominant level's actions. This can potentially make both levels stuck in local optima, ultimately hindering subsequent subgoal reachability. Allowing real-time bilateral information sharing and error correction would be a natural cure for this issue, which motivates us to propose a mutual response mechanism. Based on this, we propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO)--a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.

Paper Structure

This paper contains 44 sections, 5 theorems, 59 equations, 13 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.3

The performance difference bound $C$ between the induced optimal hierarchical policies $\Pi^*$ and the learned one $\Pi$ can be where $\epsilon^g_{\pi^*_l,\pi_l}$ is the distribution shift between $\pi^*_l$ and $\pi_l$, and $\mathcal{R}^{\pi_h,\pi_l}_{max}$ is the maximum subgoal reachability penalty from the learned one $\Pi$, both of which are formulated as,

Figures (13)

  • Figure 1: A motivating example of our proposed BrHPO. The earth, brain, and robot symbols stand for the environment, high-level policy, and low-level policy, respectively. We illustrate the behaviors of hierarchical policies before and after updated for each case. Left: Updated subgoal is limited by low-level exploration. Middle: Low-level policy struggles to approach the fixed subgoal. Right: hierarchical policies are mutually responsive for subgoal reachability.
  • Figure 2: The state-subgoal trajectory comparison of baselines HIRO (a), RIS (b) and our BrHPO (c). We visualize the state trajectories (represented by the red-to-blue gradient lines) and the guided subgoals (represented by triangles). Note that lines and triangles of the same colour indicate that they belong to the same subtask. The results demonstrate that BrHPO can improve the alignment between states and subgoals, thus benefitting overall performance.
  • Figure 3: Environments used in our experiments. In maze tasks, the red square indicates the start point and the blue square represents the target point. In manipulation tasks, a robotic arm aims to make its end-effector and (puck-shaped) grey object reach the target position, which is marked as a red ball, respectively.
  • Figure 4: The average success rate in various continuous control tasks of BrHPO and baselines. The solid lines are the average success rate, while the shades indicate the standard error of the average performance. All algorithms are evaluated with $5$ random seeds.
  • Figure 5: The performance and state-subgoal trajectory visualization from different BrHPO variants.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Definition 3.1
  • Definition 3.2: Joint Value Function of Hierarchical Policies
  • Theorem 3.3: Sub-optimal performance difference bound of HRL
  • Theorem A.1: Sub-optimal performance difference bound of HRL
  • proof
  • Proposition A.2: Equivalence between $\pi^*$ and $\Pi^*$
  • proof
  • Lemma A.3: Bellman Backup in HRL
  • proof
  • Lemma A.4: Markov chain TVD bound, time-varying
  • ...and 1 more