Table of Contents
Fetching ...

Diffusion Imitation from Observation

Bo-Ruei Huang, Chun-Kai Yang, Chun-Mao Lai, Dai-Jie Wu, Shao-Hua Sun

TL;DR

This work employs a diffusion model to capture expert and agent transitions by generating the next state, given the current state, and reformulates the learning objective to train the diffusion model as a binary classifier and use it to provide "realness" rewards for policy learning.

Abstract

Learning from observation (LfO) aims to imitate experts by learning from state-only demonstrations without requiring action labels. Existing adversarial imitation learning approaches learn a generator agent policy to produce state transitions that are indistinguishable to a discriminator that learns to classify agent and expert state transitions. Despite its simplicity in formulation, these methods are often sensitive to hyperparameters and brittle to train. Motivated by the recent success of diffusion models in generative modeling, we propose to integrate a diffusion model into the adversarial imitation learning from observation framework. Specifically, we employ a diffusion model to capture expert and agent transitions by generating the next state, given the current state. Then, we reformulate the learning objective to train the diffusion model as a binary classifier and use it to provide "realness" rewards for policy learning. Our proposed framework, Diffusion Imitation from Observation (DIFO), demonstrates superior performance in various continuous control domains, including navigation, locomotion, manipulation, and games. Project page: https://nturobotlearninglab.github.io/DIFO

Diffusion Imitation from Observation

TL;DR

This work employs a diffusion model to capture expert and agent transitions by generating the next state, given the current state, and reformulates the learning objective to train the diffusion model as a binary classifier and use it to provide "realness" rewards for policy learning.

Abstract

Learning from observation (LfO) aims to imitate experts by learning from state-only demonstrations without requiring action labels. Existing adversarial imitation learning approaches learn a generator agent policy to produce state transitions that are indistinguishable to a discriminator that learns to classify agent and expert state transitions. Despite its simplicity in formulation, these methods are often sensitive to hyperparameters and brittle to train. Motivated by the recent success of diffusion models in generative modeling, we propose to integrate a diffusion model into the adversarial imitation learning from observation framework. Specifically, we employ a diffusion model to capture expert and agent transitions by generating the next state, given the current state. Then, we reformulate the learning objective to train the diffusion model as a binary classifier and use it to provide "realness" rewards for policy learning. Our proposed framework, Diffusion Imitation from Observation (DIFO), demonstrates superior performance in various continuous control domains, including navigation, locomotion, manipulation, and games. Project page: https://nturobotlearninglab.github.io/DIFO
Paper Structure (31 sections, 8 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Diffusion Imitation from Observation (DIFO). We propose Diffusion Imitation from Observation (DIFO), a novel adversarial imitation learning from observation framework employing a conditional diffusion model. (a) Learning diffusion discriminator. In the discriminator step the diffusion model learns to model a state transition $(\mathbf{s}, \mathbf{s}')$ by conditioning on the current state $\mathbf{s}$ and generates the next state $\mathbf{s}'$. With the additional condition on binary expert and agent labels ($c_E/c_A$), we construct the diffusion discriminator to distinguish expert and agent transitions by leveraging the single-step denoising loss as a likelihood approximation. (b) Learning policy with diffusion reward. In the policy step, we optimize the policy with reinforcement learning according to rewards calculated based on the diffusion discriminator's output $\log(1 - \mathcal{D}_{\phi}(\mathbf{s},\mathbf{s'}))$.
  • Figure 2: Environments & tasks.(a) PointMaze: A point agent (green) is trained to navigate to the goal (red). (b) AntMaze: A high-dimensional locomotion navigation task for an 8-DoF quadruped ant to navigate to the goal (red). (c) FetchPush: A manipulation task to move a block (yellow) to the target (red). (d) AdroitDoor: A high-dimension manipulation task to undo the latch and swing the door open. (e) Walker: A locomotion task for a 6-DoF hopper to maintain at the highest speed while keeping balance. (f) OpenMicrowave: A manipulation task to control the robot arm to open the microwave with joint space control. (g) CarRacing: An image-based task to control the car to complete the track in the shortest time. (h) CloseDrawer: An image-based manipulation task to control the robot arm to close the drawer.
  • Figure 3: Learning performance and efficiency. We evaluate all the methods with five random seeds and report their success rates in PointMaze, AntMaze, FetchPush, AdroitDoor, OpenMicrowave, and CloseDrawer, and their returns in Walker, and CarRacing. The standard deviation is shown as the shaded area. Our proposed method, DIFO, demonstrates more stable and faster learning performance compared to the baselines.
  • Figure 4: Data efficiency. We vary the amount of available expert demonstrations in AntMaze. Our proposed method DIFO consistently outperforms other methods when the number of expert demonstrations decreases, highlighting the data efficiency of DIFO.
  • Figure 5: Generated trajectories under PointMaze. The green point marks the initial state. The red point marks the goal. The blue trace represents the generated trajectory and the orange trace represents the corresponding expert trajectory.
  • ...and 6 more figures