Table of Contents
Fetching ...

Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning

Xiaoyu Zhang, Matthew Chang, Pranav Kumar, Saurabh Gupta

TL;DR

This work addresses the costly data-collection bottleneck in imitation learning for eye-in-hand robotics by introducing Diffusion Meets DAgger (DMD), which synthesizes off-trajectory views with a conditional diffusion model and assigns corrective labels to augment the expert dataset. The method leverages a diffusion model conditioned on a reference image and a pose transformation ${}_{a}T_{b}$ to generate perturbations $Δp$ and uses future frames $I_{t+k}$ to label augmented samples, mitigating overshoot and improving policy robustness. Across pushing, stacking, pouring, and shirt-hanging tasks, DMD consistently outperforms vanilla behavior cloning and NeRF-based SPARTN augmentation, achieving high success rates with far fewer demonstrations, and shows strong generalization to unseen objects and environments. The results demonstrate the practical impact of data-creation strategies for sample-efficient imitation learning in dynamic manipulation tasks, with limitations around recoverability and future directions including multi-modal data and more discontinuous dynamics.

Abstract

A common failure mode for policies trained with imitation is compounding execution errors at test time. When the learned policy encounters states that are not present in the expert demonstrations, the policy fails, leading to degenerate behavior. The Dataset Aggregation, or DAgger approach to this problem simply collects more data to cover these failure states. However, in practice, this is often prohibitively expensive. In this work, we propose Diffusion Meets DAgger (DMD), a method to reap the benefits of DAgger without the cost for eye-in-hand imitation learning problems. Instead of collecting new samples to cover out-of-distribution states, DMD uses recent advances in diffusion models to synthesize these samples. This leads to robust performance from few demonstrations. We compare DMD against behavior cloning baseline across four tasks: pushing, stacking, pouring, and shirt hanging. In pushing, DMD achieves 80% success rate with as few as 8 expert demonstrations, where naive behavior cloning reaches only 20%. In stacking, DMD succeeds on average 92% of the time across 5 cups, versus 40% for BC. When pouring coffee beans, DMD transfers to another cup successfully 80% of the time. Finally, DMD attains 90% success rate for hanging shirt on a clothing rack.

Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning

TL;DR

This work addresses the costly data-collection bottleneck in imitation learning for eye-in-hand robotics by introducing Diffusion Meets DAgger (DMD), which synthesizes off-trajectory views with a conditional diffusion model and assigns corrective labels to augment the expert dataset. The method leverages a diffusion model conditioned on a reference image and a pose transformation to generate perturbations and uses future frames to label augmented samples, mitigating overshoot and improving policy robustness. Across pushing, stacking, pouring, and shirt-hanging tasks, DMD consistently outperforms vanilla behavior cloning and NeRF-based SPARTN augmentation, achieving high success rates with far fewer demonstrations, and shows strong generalization to unseen objects and environments. The results demonstrate the practical impact of data-creation strategies for sample-efficient imitation learning in dynamic manipulation tasks, with limitations around recoverability and future directions including multi-modal data and more discontinuous dynamics.

Abstract

A common failure mode for policies trained with imitation is compounding execution errors at test time. When the learned policy encounters states that are not present in the expert demonstrations, the policy fails, leading to degenerate behavior. The Dataset Aggregation, or DAgger approach to this problem simply collects more data to cover these failure states. However, in practice, this is often prohibitively expensive. In this work, we propose Diffusion Meets DAgger (DMD), a method to reap the benefits of DAgger without the cost for eye-in-hand imitation learning problems. Instead of collecting new samples to cover out-of-distribution states, DMD uses recent advances in diffusion models to synthesize these samples. This leads to robust performance from few demonstrations. We compare DMD against behavior cloning baseline across four tasks: pushing, stacking, pouring, and shirt hanging. In pushing, DMD achieves 80% success rate with as few as 8 expert demonstrations, where naive behavior cloning reaches only 20%. In stacking, DMD succeeds on average 92% of the time across 5 cups, versus 40% for BC. When pouring coffee beans, DMD transfers to another cup successfully 80% of the time. Finally, DMD attains 90% success rate for hanging shirt on a clothing rack.
Paper Structure (43 sections, 2 equations, 11 figures, 5 tables)

This paper contains 43 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Eye-in-hand Imitation learning with DMD: A common failure mode in an imitation learning setting is the problem of poor generalization due to compounding execution errors at test time as shown in (a). This can be solved by collecting more expert data to cover these off-trajectory states as shown in (b) however, this is an expensive process. Our proposed approach is to synthesize data instead of collecting it (c). Magenta arrow represents small perturbation ($\Delta p$) to the trajectory. Cyan arrow represents label ($\tilde{a}_t$) for this out-of-distribution observation. We use a state-of-the-art diffusion model to take images $I_t$ from expert demonstrations (d, left) and generate realistic off-trajectory images $\tilde{I}_t$ (d, right). Note the distance between the grabber and the apple denoted by the green line. This synthetic data augments expert demonstrations for policy learning, leading to more robust policies.
  • Figure 2: DMD System Overview: Our system operates in three stages. a) A diffusion model is trained, using task and play data, to synthesize novel views relative to a given image. b) This diffusion model is used to generate an augmenting dataset that contains off-trajectory views ($\tilde{I_2^1}$, $\tilde{I_2^2}$) from expert demonstrations. Labels for these views (cyan arrows) are constructed such that off-trajectory views will still converge towards task success (right). Images with a green border are from trajectories in the original task dataset. Purple-outlined images are diffusion-generated augmenting samples. c) The original task data and augmenting dataset are combined for policy learning.
  • Figure 3: DMD Architecture: We use the architecture introduced in yu2023long, a U-Net diffusion model with blocks composed of convolution, self-attention, and cross attention layers. The conditioning image $I_a$, and noised target image $I_b$ are processed in parallel except at cross-attention layers. The pose conditioning information is injected at cross-attention layers.
  • Figure 4: Training Examples from the Diffusion Model and Computed Labels: We visualize generated examples, $\tilde{I}$, used to train our policies along with the computed action label (the arrow is a projection of the 3D action into the 2D image plane: the arrow pointing up means move the gripper forward, pointing to the right means move it right). The first row shows augmenting samples for the pushing task, while the second row shows those for the stacking task.
  • Figure 5: The Overshooting Problem: When the generated image $\tilde{I}_t$ exceeds $I_{t+1}$, the inferred action for $\tilde{I}_t$ may direct the agent away from task success. We refer to this as the overshooting problem. At time step $t+1$, the view $I_{t+1}$ has moved to the lower right of $I_t$. However, the synthesized sample $\tilde{I}_t$ has moved even further to the right than $I_{t+1}$, but not beyond $I_{t+3}$. Blue arrow represents action label for $\tilde{I}_t$ computed using $I_{t+1}$ as the target; green arrow represents action label computed using $I_{t+3}$ as the target. Since $\tilde{I}_t$ has overshot$I_{t+1}$, an action taken with $I_{t+1}$ as the next intended target moves backward, away from the apple, and this labeling is not desirable. Computing the action with respect to a farther image, say $I_{t+3}$, does not have this issue.
  • ...and 6 more figures