Table of Contents
Fetching ...

Imitation from Diverse Behaviors: Wasserstein Quality Diversity Imitation Learning with Single-Step Archive Exploration

Xingrui Yu, Zhenglin Wan, David Mark Bossens, Yueming Lyu, Qing Guo, Ivor W. Tsang

TL;DR

This work tackles the challenge of learning a diverse set of high-quality policies from limited demonstrations by marrying quality-diversity imitation learning with stable, latent-space adversarial training. It introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which stabilizes reward learning via latent Wasserstein adversarial training in a Wasserstein Auto-Encoder and combats behavior overfitting with a measure-conditioned reward and a single-step archive exploration bonus. The proposed approach, including variants like mCWAE-WGAIL, demonstrates near-expert to beyond-expert quality-diversity performance on MuJoCo locomotion tasks, with ablations confirming the effectiveness of latent Wasserstein training and the exploration/measures components. This framework offers a practical path to robust, diverse imitation in settings with scarce demonstrations and complex locomotion behavior.

Abstract

Learning diverse and high-performance behaviors from a limited set of demonstrations is a grand challenge. Traditional imitation learning methods usually fail in this task because most of them are designed to learn one specific behavior even with multiple demonstrations. Therefore, novel techniques for \textit{quality diversity imitation learning}, which bridges the quality diversity optimization and imitation learning methods, are needed to solve the above challenge. This work introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which 1) improves the stability of imitation learning in the quality diversity setting with latent adversarial training based on a Wasserstein Auto-Encoder (WAE), and 2) mitigates a behavior-overfitting issue using a measure-conditioned reward function with a single-step archive exploration bonus. Empirically, our method significantly outperforms state-of-the-art IL methods, achieving near-expert or beyond-expert QD performance on the challenging continuous control tasks derived from MuJoCo environments.

Imitation from Diverse Behaviors: Wasserstein Quality Diversity Imitation Learning with Single-Step Archive Exploration

TL;DR

This work tackles the challenge of learning a diverse set of high-quality policies from limited demonstrations by marrying quality-diversity imitation learning with stable, latent-space adversarial training. It introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which stabilizes reward learning via latent Wasserstein adversarial training in a Wasserstein Auto-Encoder and combats behavior overfitting with a measure-conditioned reward and a single-step archive exploration bonus. The proposed approach, including variants like mCWAE-WGAIL, demonstrates near-expert to beyond-expert quality-diversity performance on MuJoCo locomotion tasks, with ablations confirming the effectiveness of latent Wasserstein training and the exploration/measures components. This framework offers a practical path to robust, diverse imitation in settings with scarce demonstrations and complex locomotion behavior.

Abstract

Learning diverse and high-performance behaviors from a limited set of demonstrations is a grand challenge. Traditional imitation learning methods usually fail in this task because most of them are designed to learn one specific behavior even with multiple demonstrations. Therefore, novel techniques for \textit{quality diversity imitation learning}, which bridges the quality diversity optimization and imitation learning methods, are needed to solve the above challenge. This work introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which 1) improves the stability of imitation learning in the quality diversity setting with latent adversarial training based on a Wasserstein Auto-Encoder (WAE), and 2) mitigates a behavior-overfitting issue using a measure-conditioned reward function with a single-step archive exploration bonus. Empirically, our method significantly outperforms state-of-the-art IL methods, achieving near-expert or beyond-expert QD performance on the challenging continuous control tasks derived from MuJoCo environments.

Paper Structure

This paper contains 28 sections, 10 equations, 7 figures, 10 tables, 6 algorithms.

Figures (7)

  • Figure 1: Illustration of the two issues of the Adversarial QDIL (i.e., training instability and behavior-overfitted reward) and their corresponding solutions (i.e. WQDIL and Single-Step Archive Exploration). $\delta(s)$ means the Markovian Measure Proxy of state $s$, a.k.a. the single-step measure.
  • Figure 2: Illustration of diverse behaviors learned by our Quality Diversity Imitation Learning framework on Humanoid and Walker2d, where each column represents one behavior. The "left" and "'right" means the proportion of time the left leg or right leg contacting the ground.
  • Figure 3: Visualization of the demonstrations obtained from PPGA archives. The x and y axes are the proportions of time leg 1 and 2, respectively, touch the ground. Green indicates the full expert behavior space, blue indicates the selected top-500 elites, and red indicates the selected demonstrators.
  • Figure 4: Visualization of the policy archive of Expert, GAIL, WAE-GAIL, WAE-WGAIL, WAE-WGAIL-Bonus and mCWAE-WGAIL-Bonus on Humanoid. The color indicates the cumulative rewards of best performing policy in the archive cells.
  • Figure 5: Learning curve comparison of our mCWAE-WGAIL-Bonus against WAE-WGAIL and state-of-the-art IL methods. The curves (and shaded areas) represent the means (and standard deviations) of the algorithms. Columns indicate the metric (QD-Score, Coverage, Best Reward, and Average Reward) while the rows represent the different benchmarks (Halfcheetah, Walker2d, and Humanoid).
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: Quality-Diversity Imitation Learning (QDIL)