PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning

Utsav Singh; Vinay P. Namboodiri

PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning

Utsav Singh, Vinay P. Namboodiri

TL;DR

PEAR introduces primitive enabled adaptive relabeling to address non-stationarity in off-policy HRL by generating a curriculum of achievable subgoals from a small set of expert demonstrations. It then jointly optimizes higher-level subgoal policies and lower-level primitives using RL with imitation learning regularization on a dynamically refreshed subgoal dataset $D_g$, yielding two concrete variants: PEAR-BC and PEAR-IRL. Theoretical sub-optimality bounds show how adaptive relabeling and IL regularization tighten performance guarantees, while empirical results across six Mujoco tasks and real-world robot experiments demonstrate substantial improvements over baselines, including up to 80% success in sparse long-horizon tasks. PEAR is designed to be compatible with standard off-policy algorithms and requires only minimal task-structure assumptions, making it a practical advancement for solving long-horizon HRL challenges.

Abstract

Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-phase approach where we first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision, and then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL). We perform theoretical analysis to bound the sub-optimality of our approach and derive a joint optimization framework using RL and IL. Since PEAR utilizes only a few expert demonstrations and considers minimal limiting assumptions on the task structure, it can be easily integrated with typical off-policy RL algorithms to produce a practical HRL approach. We perform extensive experiments on challenging environments and show that PEAR is able to outperform various hierarchical and non-hierarchical baselines and achieve upto $80\%$ success rates in complex sparse robotic control tasks where other baselines typically fail to show significant progress. We also perform ablations to thoroughly analyse the importance of our various design choices. Finally, we perform real world robotic experiments on complex tasks and demonstrate that PEAR consistently outperforms the baselines.

PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning

TL;DR

, yielding two concrete variants: PEAR-BC and PEAR-IRL. Theoretical sub-optimality bounds show how adaptive relabeling and IL regularization tighten performance guarantees, while empirical results across six Mujoco tasks and real-world robot experiments demonstrate substantial improvements over baselines, including up to 80% success in sparse long-horizon tasks. PEAR is designed to be compatible with standard off-policy algorithms and requires only minimal task-structure assumptions, making it a practical advancement for solving long-horizon HRL challenges.

Abstract

success rates in complex sparse robotic control tasks where other baselines typically fail to show significant progress. We also perform ablations to thoroughly analyse the importance of our various design choices. Finally, we perform real world robotic experiments on complex tasks and demonstrate that PEAR consistently outperforms the baselines.

Paper Structure (28 sections, 1 theorem, 24 equations, 21 figures, 2 algorithms)

This paper contains 28 sections, 1 theorem, 24 equations, 21 figures, 2 algorithms.

Introduction
Related Work
Background
Methodology
Primitive Enabled Adaptive Relabeling
Joint optimization
Sub-optimality analysis
Experiments
Evaluation and Results
Ablative analysis
Discussion
Appendix
Sub-optimality analysis
Sub-optimality proof for higher level policy
Sub-optimality proof for lower level policy
...and 13 more sections

Key Result

Theorem 1

Assuming optimal policy $\pi^{*}$ is $\phi_D$ common in $\Pi_{D}^{H}$, the suboptimality of higher policy $\pi_{\theta_{H}}^{H}$, over $c$ length sub-trajectories $\tau$ sampled from $d_{c}^{\pi^{*}}$can be bounded as: where$\lambda_{H}=\frac{2}{(1-\gamma)(1-\gamma^{c})}R_{max} \| \frac{d_c^{\pi^{*}}}{\kappa} \|_{\infty}$

Figures (21)

Figure 1: Adaptive Relabeling Overview: We segment expert demonstrations by consecutively passing demonstration states as subgoals to the lower primitive, and finding the state where $Q_{\pi^{L}}(s,s_i,a_i)<Q_{thresh}$ (here $s_i=s_4$). Since $s_3$ was the last reachable subgoal, it is selected as subgoal for initial state $s_0$. The transition is added to $D_g$, and the process continues with $s_3$ as new initial state.
Figure 2: Subgoal evolution: With training, as the lower primitive improves, the higher level subgoal predictions (blue spheres) become better and harder, while always being achievable by lower primitive. Row 1 depicts initial training, Row 2 depicts mid-way through training, and Row 3 depicts end of training. This generates a curriculum of achievable subgoals for lower primitive (red spheres: final goal).
Figure 3: Success rate comparison This figure compares the success rate performances on six sparse maze navigation and manipulation tasks. The solid line and shaded region represent the mean and range of success rates across 5 seeds. As seen, PEAR shows impressive performance and significantly outperforms the baselines.
Figure 4: Non-stationarity metric comparison This figure compares the average distance metric between the subgoals predicted by the higher level policy and the subgoals achieved by the lower level policy during training. As seen, PEAR consistently produces efficient subgoals leading to low distances between the predicted and achieved subgoals throughout the training process. This mitigates non-stationarity in HRL.
Figure 5: The success rate plots show success rate performance comparison between PEAR-IRL (red), PEAR-BC (black) and PEAR-RPL (blue) ablation. PEAR-IRL and PEAR-BC clearly outperform PEAR-RPL in almost all the tasks.
...and 16 more figures

Theorems & Definitions (3)

Theorem 1
proof
proof

PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning

TL;DR

Abstract

PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (3)