Table of Contents
Fetching ...

Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?

Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, Sergey Levine

TL;DR

This paper investigates why hierarchy helps in reinforcement learning by empirically dissecting claimed benefits across locomotion, navigation, and manipulation tasks. It finds that improved exploration—not necessarily temporally extended actions or semantic policy structure—accounts for most HRL gains, and proposes hierarchy-inspired, non-hierarchical exploration methods that achieve competitive performance. The study suggests a shift in focus from designing complex HRL architectures to developing temporally extended, semantically meaningful exploration strategies. These insights offer practical guidance for enhancing exploration in RL and highlight directions for realizing additional benefits of hierarchical structures in future work.

Abstract

Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard "shallow" RL architectures. In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.

Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?

TL;DR

This paper investigates why hierarchy helps in reinforcement learning by empirically dissecting claimed benefits across locomotion, navigation, and manipulation tasks. It finds that improved exploration—not necessarily temporally extended actions or semantic policy structure—accounts for most HRL gains, and proposes hierarchy-inspired, non-hierarchical exploration methods that achieve competitive performance. The study suggests a shift in focus from designing complex HRL architectures to developing temporally extended, semantically meaningful exploration strategies. These insights offer practical guidance for enhancing exploration in RL and highlight directions for realizing additional benefits of hierarchical structures in future work.

Abstract

Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard "shallow" RL architectures. In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.

Paper Structure

This paper contains 12 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: We consider four difficult tasks, where the agent (magenta) is a simulated quadrupedal robot. In AntMaze, the agent must navigate to the end of a U-shaped corridor (target given by green arrow); in AntPush, the agent must navigate to the target by first pushing a block obstacle to the right; in AntBlock and AntBlockMaze, the agent must push a small red block to the target location; see nachum2018near for more details. Task success rates are plotted for three HRL algorithms -- HIRO hiro, HIRO with goal relabelling (inspired by levy2017hierarchical), and Options frans2017meta -- and shallow (non-hierarchical) agents with and without the use of multi-step rewards ($n$-step returns) over 10M training steps, averaged over 5 seeds. In this work, we isolate and evaluate the key properties of hierarchy which yield the stark difference in empirical performance between HRL and non-HRL methods.
  • Figure 2: We present the results for different HRL methods while changing the temporal abstraction used for training ($c_{\mathrm{train}}$, top) or the temporal abstraction used for experience collection ($c_{\mathrm{expl}}$, bottom). Average success rates and standard errors are calculated for 5 randomly seeded runs, trained for 10M steps with early stopping. Recall that our HRL baselines use $c_{\mathrm{train}}=c_{\mathrm{expl}}=10$. When varying $c_{\mathrm{train}}$, we find that the choice of horizon matters only so far as $c_{\mathrm{train}}>1$. For $c_{\mathrm{expl}}$, while there exists correlation between performance and temporal abstraction, using no temporal abstraction ($c_{\mathrm{expl}}=1$) can still make non-negligible progress compared to the shallow policies in Figure \ref{['fig:eval-shallow']}.
  • Figure 3: We evaluate and compare the performance of training a non-hierarchical shadow agent trained on experience collected by a hierarchical agent, thus disentangling the potential benefits of HRL for exploration from the potential benefits of HRL for training. In all environments except AntMaze, the shadow agent can achieve performance competitive with HRL, given an appropriate multi-step reward horizon ($c_{\mathrm{rew}}=3$ performs best). Overall, this suggests that the effect of hierarchy on ease of training (as opposed to exploration) is modest, and can mostly be replicated by a non-hierarchical agent given good experience and the use of multi-step rewards.
  • Figure 4: We compare the performance of HRL to Explore & Exploit (E&E) and Switching Ensemble (SE) -- two non-hierarchical exploration methods that make use of HRL-inspired temporally extended modulation of behaviors (length of modulation given by $c_{\mathrm{switch}}$). We find that the non-hierarchical methods are able to match the performance of HRL on these tasks (with the only exceptions being Explore & Exploit on AntBlockMaze and Switching Ensemble on AntPush), suggesting that exploration is the key to success on these tasks. These results also make clear the importance of temporally extended exploration; using $c_{\mathrm{switch}}>1$ is almost always better than $c_{\mathrm{switch}}=1$.
  • Figure 5: A summary of our conclusions on the benefits of hierarchy.
  • ...and 2 more figures