Table of Contents
Fetching ...

SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Pengcheng Li, Qiang Fang, Tong Zhao, Yixing Lan, Xin Xu

TL;DR

SD2AIL tackles the scarcity of expert demonstrations in Adversarial Imitation Learning by injecting diffusion-model–generated pseudo-experts into the discriminator training. A dynamic confidence-based selection produces high-quality synthetic demonstrations, which are replayed via Prioritized Expert Demonstration Replay (PEDR) to accelerate learning. The approach, combined with SAC for policy optimization, yields superior or competitive results across four MuJoCo continuous-control tasks, with faster convergence and strong alignment between surrogate and true rewards. The work demonstrates that diffusion-enabled synthetic data can substantially boost AIL performance while maintaining data efficiency, albeit with additional training time due to diffusion computations.

Abstract

Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at https://github.com/positron-lpc/SD2AIL.

SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

TL;DR

SD2AIL tackles the scarcity of expert demonstrations in Adversarial Imitation Learning by injecting diffusion-model–generated pseudo-experts into the discriminator training. A dynamic confidence-based selection produces high-quality synthetic demonstrations, which are replayed via Prioritized Expert Demonstration Replay (PEDR) to accelerate learning. The approach, combined with SAC for policy optimization, yields superior or competitive results across four MuJoCo continuous-control tasks, with faster convergence and strong alignment between surrogate and true rewards. The work demonstrates that diffusion-enabled synthetic data can substantially boost AIL performance while maintaining data efficiency, albeit with additional training time due to diffusion computations.

Abstract

Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at https://github.com/positron-lpc/SD2AIL.

Paper Structure

This paper contains 10 sections, 8 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: SD2AIL. The overview of our proposed pipeline. (a) The diffusion discriminator $D_\phi$ samples expert demonstrations and pseudo-expert demonstrations using the prioritized expert demonstration replay method. The discriminator learns to differentiate between this data and the agent's data. (b) During the optimization of the policy $\pi_\theta$, the diffusion discriminator calculates the reward $R$ from the output $D_\phi$ and generates pseudo-expert demonstrations during the denoising process. The policy network learns to maximize the reward from the discriminator. The generated pseudo-expert demonstrations are used for the next round of discriminator training.
  • Figure 2: Four classic continuous control tasks (Ant, Hopper, Walker and Halfcheetah) in MuJoCo
  • Figure 3: Comparison of average returns on the Ant, Hopper, Walker, and HalfCheetah tasks, each run with five random seeds. Our method achieves faster convergence than the baselines in Hopper, Walker, and HalfCheetah.
  • Figure 4: Visualization of demonstrations. The pseudo-expert demonstrations exhibit a distribution similar to that of expert demonstrations. The additional data provides clearer guidance for the discriminator's training.
  • Figure 5: FD to the expert demonstrations distribution for a random policy and pseudo-expert demonstrations. A smaller FD indicates greater similarity between the two distributions. The FD of the pseudo-expert demonstrations is significantly smaller than that of the random policy, indicating the validity of the generated samples.
  • ...and 3 more figures