On the benefits of pixel-based hierarchical policies for task generalization

Tudor Cristea-Platon; Bogdan Mazoure; Josh Susskind; Walter Talbott

On the benefits of pixel-based hierarchical policies for task generalization

Tudor Cristea-Platon, Bogdan Mazoure, Josh Susskind, Walter Talbott

TL;DR

This work investigates whether pixel-based hierarchical reinforcement learning (HRL) with task conditioning can improve generalization across tasks. Building on the Director architecture, it analyzes how a high-level manager and a low-level worker can compose reusable skills to shorten the effective horizon from $H$ to $H/k$, enable zero-shot generalization through compositionality, and accelerate fast adaptation by reusing low-level policies. The results show that HRL improves training performance on multi-task tasks, enhances reward and state-space generalization to similar tasks, and reduces the data required for solving novel tasks during fine-tuning. Overall, the findings advocate for incorporating hierarchy in RL architectures to promote generalization in vision-based robotic control scenarios.

Abstract

Reinforcement learning practitioners often avoid hierarchical policies, especially in image-based observation spaces. Typically, the single-task performance improvement over flat-policy counterparts does not justify the additional complexity associated with implementing a hierarchy. However, by introducing multiple decision-making levels, hierarchical policies can compose lower-level policies to more effectively generalize between tasks, highlighting the need for multi-task evaluations. We analyze the benefits of hierarchy through simulated multi-task robotic control experiments from pixels. Our results show that hierarchical policies trained with task conditioning can (1) increase performance on training tasks, (2) lead to improved reward and state-space generalizations in similar tasks, and (3) decrease the complexity of fine tuning required to solve novel tasks. Thus, we believe that hierarchical policies should be considered when building reinforcement learning architectures capable of generalizing between tasks.

On the benefits of pixel-based hierarchical policies for task generalization

TL;DR

, enable zero-shot generalization through compositionality, and accelerate fast adaptation by reusing low-level policies. The results show that HRL improves training performance on multi-task tasks, enhances reward and state-space generalization to similar tasks, and reduces the data required for solving novel tasks during fine-tuning. Overall, the findings advocate for incorporating hierarchy in RL architectures to promote generalization in vision-based robotic control scenarios.

Abstract

Paper Structure (16 sections, 9 figures, 2 tables)

This paper contains 16 sections, 9 figures, 2 tables.

Introduction
Related Work
Motivation
Shortening the effective task horizon
Zero-shot generalization through compositionality
Fast few-shot adaptation
Experiments
In-domain performance
Locomotion
Navigation
Zero-shot generalization performance
Locomotion
Navigation
Ablation of goal selection frequency
Few-shot generalization performance
...and 1 more sections

Figures (9)

Figure 1: Architecture overview, adapted from Director Director, augmented with task conditioning. Image observations and task information, in the form of extrinsic rewards MELD, are passed to a world model (WM) such as PlaNet Planet, which encodes them into latent states. These are used to train a categorical VAE and at the same time are passed to the hierarchical policies Director, i.e. the higher level policy (called the manager) and the lower level one (called the worker). The manager selects abstract actions in the latent space of the VAE, which are decoded as latent space goal states before passed to the worker. Finally, the worker outputs primitive actions in an attempt to match the goal states set by the manager.
Figure 2: Illustration of the key components of the hierarchical RL paradigm - short effective horizon, compositionality and fast adaptation. The manager prescribes an abstraction action, $a_1$, at time step 1, following which the worker takes primitive actions (joint actuations) for a predetermined period of time (the goal horizon), for example six steps. Only after this sequence of steps has been executed, will the manager take a second action, $a_2$, at time step 7 with the process repeating. Thus, in the figure above, while the flat policy has an effective horizon of 12, the hierarchy has only an effective horizon of 2.
Figure 3: The training curves for the locomotion tasks.
Figure 4: The training task represents a quadruped reaching a re-spawning green spherical target in a $5\times 5$ box with colored walls.
Figure 5: Sequence of image stills from locomotion episodes. While both the hierarchical and the flat policies are able to match the target walking speed under training condition, e.g. $v_\text{target}=2.0$, only the hierarchical agent can generalize to an unseen speed, e.g. $v_\text{target}=5.0$.
...and 4 more figures

On the benefits of pixel-based hierarchical policies for task generalization

TL;DR

Abstract

On the benefits of pixel-based hierarchical policies for task generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)