Imitation from Diverse Behaviors: Wasserstein Quality Diversity Imitation Learning with Single-Step Archive Exploration
Xingrui Yu, Zhenglin Wan, David Mark Bossens, Yueming Lyu, Qing Guo, Ivor W. Tsang
TL;DR
This work tackles the challenge of learning a diverse set of high-quality policies from limited demonstrations by marrying quality-diversity imitation learning with stable, latent-space adversarial training. It introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which stabilizes reward learning via latent Wasserstein adversarial training in a Wasserstein Auto-Encoder and combats behavior overfitting with a measure-conditioned reward and a single-step archive exploration bonus. The proposed approach, including variants like mCWAE-WGAIL, demonstrates near-expert to beyond-expert quality-diversity performance on MuJoCo locomotion tasks, with ablations confirming the effectiveness of latent Wasserstein training and the exploration/measures components. This framework offers a practical path to robust, diverse imitation in settings with scarce demonstrations and complex locomotion behavior.
Abstract
Learning diverse and high-performance behaviors from a limited set of demonstrations is a grand challenge. Traditional imitation learning methods usually fail in this task because most of them are designed to learn one specific behavior even with multiple demonstrations. Therefore, novel techniques for \textit{quality diversity imitation learning}, which bridges the quality diversity optimization and imitation learning methods, are needed to solve the above challenge. This work introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which 1) improves the stability of imitation learning in the quality diversity setting with latent adversarial training based on a Wasserstein Auto-Encoder (WAE), and 2) mitigates a behavior-overfitting issue using a measure-conditioned reward function with a single-step archive exploration bonus. Empirically, our method significantly outperforms state-of-the-art IL methods, achieving near-expert or beyond-expert QD performance on the challenging continuous control tasks derived from MuJoCo environments.
