Table of Contents
Fetching ...

Goal-conditioned Imitation Learning

Yiming Ding, Carlos Florensa, Mariano Phielipp, Pieter Abbeel

TL;DR

The paper tackles reward design challenges in goal-directed robotics by marrying goal-conditioned imitation learning with hindsight relabeling. It introduces goalGAIL, an off-policy adversarial imitation framework conditioned on goals, and augments demonstrations with expert relabeling and the option to use state-only demonstrations. Across four continuous MuJoCo tasks, goalGAIL accelerates learning beyond HER and is robust to suboptimal experts, while expert relabeling and state-only demos further boost performance. The approach holds promise for data-efficient, reward-free policy learning in real-world robotics and can extend to vision-based settings in future work.

Abstract

Designing rewards for Reinforcement Learning (RL) is challenging because it needs to convey the desired task, be efficient to optimize, and be easy to compute. The latter is particularly problematic when applying RL to robotics, where detecting whether the desired configuration is reached might require considerable supervision and instrumentation. Furthermore, we are often interested in being able to reach a wide range of configurations, hence setting up a different reward every time might be unpractical. Methods like Hindsight Experience Replay (HER) have recently shown promise to learn policies able to reach many goals, without the need of a reward. Unfortunately, without tricks like resetting to points along the trajectory, HER might require many samples to discover how to reach certain areas of the state-space. In this work we investigate different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms. Furthermore, we show our method can also be used when the available expert trajectories do not contain the actions, which can leverage kinesthetic or third person demonstration. The code is available at https://sites.google.com/view/goalconditioned-il/.

Goal-conditioned Imitation Learning

TL;DR

The paper tackles reward design challenges in goal-directed robotics by marrying goal-conditioned imitation learning with hindsight relabeling. It introduces goalGAIL, an off-policy adversarial imitation framework conditioned on goals, and augments demonstrations with expert relabeling and the option to use state-only demonstrations. Across four continuous MuJoCo tasks, goalGAIL accelerates learning beyond HER and is robust to suboptimal experts, while expert relabeling and state-only demos further boost performance. The approach holds promise for data-efficient, reward-free policy learning in real-world robotics and can extend to vision-based settings in future work.

Abstract

Designing rewards for Reinforcement Learning (RL) is challenging because it needs to convey the desired task, be efficient to optimize, and be easy to compute. The latter is particularly problematic when applying RL to robotics, where detecting whether the desired configuration is reached might require considerable supervision and instrumentation. Furthermore, we are often interested in being able to reach a wide range of configurations, hence setting up a different reward every time might be unpractical. Methods like Hindsight Experience Replay (HER) have recently shown promise to learn policies able to reach many goals, without the need of a reward. Unfortunately, without tricks like resetting to points along the trajectory, HER might require many samples to discover how to reach certain areas of the state-space. In this work we investigate different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms. Furthermore, we show our method can also be used when the available expert trajectories do not contain the actions, which can leverage kinesthetic or third person demonstration. The code is available at https://sites.google.com/view/goalconditioned-il/.

Paper Structure

This paper contains 16 sections, 5 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Policy performance on reaching different goals in the four rooms, when training with standard Behavioral Cloning (top row) or with our expert relabeling (bottom).
  • Figure 2: Four continuous goal-conditioned environments where we tested the effectiveness of the proposed algorithm goalGAIL and expert relabeling technique.
  • Figure 3: In all four environments, the proposed algorithm goalGAIL takes off and converges faster than HER by leveraging demonstrations. It is also able to outperform the demonstrator unlike standard GAIL, the performance of which is capped.
  • Figure 4: Our Expert Relabeling technique boosts final performance of standard BC. It also accelerates convergence of BC+HER and goalGAIL on all four environments.
  • Figure 5: Effect of sub-optimal demonstrations on goalGAIL and Behavorial Cloning method. We produce sub-optimal demonstrations by making the expert $\epsilon$-greedy and adding Gaussian noise to the optimal actions.
  • ...and 2 more figures