Table of Contents
Fetching ...

Meta Learning Shared Hierarchies

Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, John Schulman

TL;DR

The paper tackles sample efficiency in reinforcement learning by meta-learning hierarchically structured policies across task distributions. It introduces MLSH, a framework that learns shared sub-policies (motor primitives) and a task-specific master policy to rapidly adapt to new tasks via end-to-end training with off-the-shelf RL methods. The authors formalize the problem, present a two-phase update (warmup and joint update), and demonstrate that learned primitives enable fast adaptation, transfer to sparse rewards, and robust performance in 2D, 3D, and physics-based robotics tasks. The work highlights strong transfer capabilities and scalability, offering a flexible approach to constructing reusable skill hierarchies without hand-engineered primitives.

Abstract

We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring the strength of such hierarchies, leading to an optimization problem for quickly reaching high reward on unseen tasks. We then present an algorithm to solve this problem end-to-end through the use of any off-the-shelf reinforcement learning method, by repeatedly sampling new tasks and resetting task-specific policies. We successfully discover meaningful motor primitives for the directional movement of four-legged robots, solely by interacting with distributions of mazes. We also demonstrate the transferability of primitives to solve long-timescale sparse-reward obstacle courses, and we enable 3D humanoid robots to robustly walk and crawl with the same policy.

Meta Learning Shared Hierarchies

TL;DR

The paper tackles sample efficiency in reinforcement learning by meta-learning hierarchically structured policies across task distributions. It introduces MLSH, a framework that learns shared sub-policies (motor primitives) and a task-specific master policy to rapidly adapt to new tasks via end-to-end training with off-the-shelf RL methods. The authors formalize the problem, present a two-phase update (warmup and joint update), and demonstrate that learned primitives enable fast adaptation, transfer to sparse rewards, and robust performance in 2D, 3D, and physics-based robotics tasks. The work highlights strong transfer capabilities and scalability, offering a flexible approach to constructing reusable skill hierarchies without hand-engineered primitives.

Abstract

We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring the strength of such hierarchies, leading to an optimization problem for quickly reaching high reward on unseen tasks. We then present an algorithm to solve this problem end-to-end through the use of any off-the-shelf reinforcement learning method, by repeatedly sampling new tasks and resetting task-specific policies. We successfully discover meaningful motor primitives for the directional movement of four-legged robots, solely by interacting with distributions of mazes. We also demonstrate the transferability of primitives to solve long-timescale sparse-reward obstacle courses, and we enable 3D humanoid robots to robustly walk and crawl with the same policy.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Structure of a hierarchical sub-policy agent. $\theta$ represents the master policy, which selects a sub-policy to be active. In the diagram, $\phi_3$ is the active sub-policy, and actions are taken according to its output.
  • Figure 2: Unrolled structure for a master policy action lasting $N=3$ timesteps. Left: When training the master policy, the update only depends on the master policy's action and total reward (blue region), treating the individual actions and rewards as part of the environment transition (red region). Right: When training sub-policies, the update considers the master policy's action as part of the observation (blue region), ignoring actions in other timesteps (red region)
  • Figure 3: Sampled tasks from 2D moving bandits. Small green dot represents the agent, while blue and yellow dots represent potential goal points. Right: Blue/red arrows correspond to movements when taking sub-policies 1 and 2 respectively.
  • Figure 4: Learning curves for 2D Moving Bandits and Four Rooms
  • Figure 5: Top: Ant Twowalk. Ant must maneuver towards red goal point, either towards the top or towards the right. Bottom Left: Walking. Humanoid must move horizontally while maintaining an upright stance. Bottom Right: Crawling. Humanoid must move horizontally while a height-limiting obstacle is present.
  • ...and 3 more figures