Imitation Learning from Purified Demonstrations
Yunke Wang, Minjing Dong, Yukun Zhao, Bo Du, Chang Xu
TL;DR
The paper tackles imitation learning with imperfect demonstrations by introducing DP-IL, a diffusion-based purification framework that first diffuses suboptimal data to remove perturbation patterns and then uses a learned reverse diffusion to recover purified demonstrations. A diffusion model is trained on a small set of optimal demonstrations and applied to purify the larger set of suboptimal ones, enabling the agent to learn from a closer approximation to the expert distribution via occupancy-measure matching. The authors provide theoretical bounds on the distance between purified and optimal demonstrations and demonstrate that DP-IL improves performance in both offline (BC) and online (GAIL) settings on MuJoCo and RoboSuite, across various noise types and demonstration qualities. The method is modular and can be integrated into existing IL frameworks, offering a practical path to robust policy learning when optimal data is scarce.
Abstract
Imitation learning has emerged as a promising approach for addressing sequential decision-making problems, with the assumption that expert demonstrations are optimal. However, in real-world scenarios, most demonstrations are often imperfect, leading to challenges in the effectiveness of imitation learning. While existing research has focused on optimizing with imperfect demonstrations, the training typically requires a certain proportion of optimal demonstrations to guarantee performance. To tackle these problems, we propose to purify the potential noises in imperfect demonstrations first, and subsequently conduct imitation learning from these purified demonstrations. Motivated by the success of diffusion model, we introduce a two-step purification via diffusion process. In the first step, we apply a forward diffusion process to smooth potential noises in imperfect demonstrations by introducing additional noise. Subsequently, a reverse generative process is utilized to recover the optimal demonstration from the diffused ones. We provide theoretical evidence supporting our approach, demonstrating that the distance between the purified and optimal demonstration can be bounded. Empirical results on MuJoCo and RoboSuite demonstrate the effectiveness of our method from different aspects.
