Table of Contents
Fetching ...

PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Utsav Singh, Wesley A. Suttle, Brian M. Sadler, Vinay P. Namboodiri, Amrit Singh Bedi

Abstract

In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50$\%$ success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.

PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Abstract

In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50 success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.
Paper Structure (25 sections, 22 equations, 11 figures, 1 algorithm)

This paper contains 25 sections, 22 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: PIPER Overview This figure shows the overview of PIPER (left). The higher level policy predicts subgoals $g_t$ for the lower primitive, which executes actions $a_t$ on the environment. We propose to learn a preference-based reward model $\widehat{r}_{\phi}$ using our PiL feedback on higher level trajectories sampled from higher level replay buffer, and subsequently use $\widehat{r}_{\phi}$ to relabel the replay buffer transitions, thereby mitigating non-stationarity in HRL. On the right, we depict the training environments: $(i)$ maze navigation environment, $(ii)$ pick and place environment, $(iii)$ push environment, $(iv)$ hollow environment, and $(v)$ franka kitchen environment.
  • Figure 2: Success rate comparison This figure compares the success rate performances on four sparse maze navigation and robotic manipulation environments. The solid line and shaded regions represent the mean and standard deviation, across $5$ seeds. We compare our approach PIPER against multiple baselines. As can be seen, PIPER shows impressive performance and significantly outperforms the baselines.
  • Figure 3: Learning rate $\alpha$ ablation This figure compares the success rate performances for various values of primitive informed regularization weight $\alpha$ hyper-parameter. If $\alpha$ is too small, we loose the advantages of primitive informed regularization, leading to degrading performance. In contrast, if $\alpha$ is too large, it may lead to degenerate solutions. Thus, these success rate performance plots demonstrate that proper primitive subgoal regularization is crucial for appropriate subgoal prediction, and improving overall performance.
  • Figure 4: Hindsight Relabeling ablation This figure compares the performance of our PIPER approach with PIPER-No-HR ablation, which is effectively PIPER without hindsight relabeling andrychowicz2017hindsight (as explained in Section \ref{['sec:hr']}). The plots showcase that although hindsight relabeling demonstrates minor performance improvement in sparse maze and kitchen tasks, it provides significant training speedup in sparse pick and place and push environments.
  • Figure 5: Target networks ablation This figure compares the performance of our PIPER approach with PIPER-No-Target ablation, which is effectively PIPER without target networks implementation lillicrap2015continuous. The plots showcase that using target networks significantly improves performance and indeed reduces training instability caused by non-stationary reward models $\widehat{r}_{\phi}$, learnt using preference based learning.
  • ...and 6 more figures