Table of Contents
Fetching ...

Diffusing States and Matching Scores: A New Framework for Imitation Learning

Runzhe Wu, Yiding Chen, Gokul Swamy, Kianté Brantley, Wen Sun

TL;DR

This paper presents SMILING, a non-adversarial imitation-learning framework that replaces discriminator training with diffusion-score matching by introducing the Diffusion Score Divergence (DS Divergence) to compare expert and learner state distributions. It pre-trains an expert score function and alternates with learner-score updates and RL to minimize a squared-score difference along a forward diffusion process, achieving a zero-sum game that follows Follow-the-Leader updates. The authors prove first- and second-order instance-dependent regret bounds that scale linearly with the horizon and depend on score-estimation and RL errors, plus a misspecification term, demonstrating reduced compounding errors relative to offline or GAN-based IL. Empirically, SMILING outperforms GAN-style baselines and discriminator-free methods on several continuous-control tasks, including humanoid locomotion and obstacle navigation, with state-only demonstrations and with state-action data, highlighting robustness and data efficiency. These contributions suggest diffusion-score matching as a stable, expressive alternative for IRL with practical impact on real-world control problems.

Abstract

Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing score-matching along diffused states to measure the discrepancy between the expert's and learner's states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms both GAN-style imitation learning baselines and discriminator-free imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, crawl, and navigate through obstacles.

Diffusing States and Matching Scores: A New Framework for Imitation Learning

TL;DR

This paper presents SMILING, a non-adversarial imitation-learning framework that replaces discriminator training with diffusion-score matching by introducing the Diffusion Score Divergence (DS Divergence) to compare expert and learner state distributions. It pre-trains an expert score function and alternates with learner-score updates and RL to minimize a squared-score difference along a forward diffusion process, achieving a zero-sum game that follows Follow-the-Leader updates. The authors prove first- and second-order instance-dependent regret bounds that scale linearly with the horizon and depend on score-estimation and RL errors, plus a misspecification term, demonstrating reduced compounding errors relative to offline or GAN-based IL. Empirically, SMILING outperforms GAN-style baselines and discriminator-free methods on several continuous-control tasks, including humanoid locomotion and obstacle navigation, with state-only demonstrations and with state-action data, highlighting robustness and data efficiency. These contributions suggest diffusion-score matching as a stable, expressive alternative for IRL with practical impact on real-world control problems.

Abstract

Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing score-matching along diffused states to measure the discrepancy between the expert's and learner's states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms both GAN-style imitation learning baselines and discriminator-free imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, crawl, and navigate through obstacles.

Paper Structure

This paper contains 26 sections, 7 theorems, 44 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Under asm:errorasm:diffusion, alg achieves these instance-dependent bounds: where we define total error $\epsilon\coloneqq $ and the misspecification $\epsilon_{\rm{mis}} \coloneqq \min_{\pi\in\Pi} \ell(\pi)$ where we recall that $\ell(\cdot)$ is defined in Eq. eq:ideal_obj.

Figures (9)

  • Figure 1: Figure \ref{['fig:ds']}: The two curves represent the forward diffusion process of distributions $P$ and $Q$. DS Divergence measures the squared difference between the diffusion score functions, $\nabla \log P_t(s_t)$ and $\nabla \log Q_t(s_t)$, along the forward diffusion process of $P$. Figure \ref{['fig:diagram']}: SMILING first pre-trains a diffusion model from the expert's data. It then iteratively trains diffusion models on learner's data and performs RL to optimize a cost function formed by the learner's score function and the pre-trained expert score function. The cost function is designed to faithfully approximate the DS divergence (\ref{['def:ds']}) between the learner and the expert.
  • Figure 2: Learning curves for learning from state-only data across five random seeds. The x-axis corresponds to the number of environment steps (also the number of policy updates). The y-axis is normalized such that the expert performance is one and the random policy is zero. Our method clearly outperforms all baselines in five tasks out of six.
  • Figure 3: Whole-body humanoid control via IL from state alone. The two panels illustrate the crawl and pole tasks, respectively. In both tasks, we show the time-lapse frames of the expert policy and the policies learned by our method (SMILING) and DAC after 3M training steps. In the crawl task, the goal is to crawl through a grey tunnel, where both the expert and ours succeed and the crawling movements are similar. However, DAC collapses and fails to complete the task. In the pole task, the goal is to travel through a dense forest of poles. Ours successfully navigates through the poles, though with less stability than the expert, while DAC collapses to the ground and cannot move. The bar graphs on the right show normalized policy performance, where SMILING significantly outperforms DAC and Behavioral Cloning (BC) in both tasks, approaching expert performance in crawl. Note that BC uses expert actions, while DAC and SMILING learn from states alone.
  • Figure 4: Learning curves for learning from state-action data across five random seeds. The x-axis corresponds to the number of environment steps (also the number of policy updates). The y-axis is normalized such that the expert performance is one and the random policy is zero. The results are consistent with those from the state-only setting.
  • Figure 5: Relationship between the number of expert demonstrations and the final performance. The x-axis corresponds to the number of expert states. The y-axis is the final performance that is computed as the average of the last 100 training epochs. Each point is the average of five random seeds.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Definition 1: Diffusion Score Divergence
  • Remark 1: Noise-prediction Form of the Cost Function
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6