Table of Contents
Fetching ...

C3DM: Constrained-Context Conditional Diffusion Models for Imitation Learning

Vaibhav Saxena, Yotto Koga, Danfei Xu

TL;DR

This paper tackles the fragility of behavior-cloning-based visuomotor policies in the presence of distractors by introducing C3DM, a constrained-context diffusion model that uses a fixation-driven, iterative refinement process (fDDP) to focus on action-relevant input regions. By jointly refining actions and constrained observations around inferred fixation points, C3DM reduces reliance on spurious correlations and achieves high 6-DoF manipulation accuracy with minimal demonstrations, including five in sim-to-real scenarios. Empirically, C3DM outperforms diffusion-policy baselines across 5 simulated tasks and demonstrates robust real-robot and sim-to-real performance with both RGB and depth inputs. The work offers a practical, data-efficient approach to robust visuomotor imitation, enabling reliable deployment in cluttered, real-world environments.

Abstract

Behavior Cloning (BC) methods are effective at learning complex manipulation tasks. However, they are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency and robustness against distractors for solving manipulation tasks with a complex action space. We present \textbf{C}onstrained-\textbf{C}ontext \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with robustness to distractions that can learn deployable robot policies from as little as five demonstrations. A key component of C3DM is a fixation step that helps the action denoiser to focus on task-relevant regions around a predicted fixation point while ignoring distractors in the context. We empirically show that C3DM is robust to out-of-distribution distractors, and consistently achieves high success rates on a wide array of tasks, ranging from table-top manipulation to industrial kitting that require varying levels of precision and robustness to distractors.

C3DM: Constrained-Context Conditional Diffusion Models for Imitation Learning

TL;DR

This paper tackles the fragility of behavior-cloning-based visuomotor policies in the presence of distractors by introducing C3DM, a constrained-context diffusion model that uses a fixation-driven, iterative refinement process (fDDP) to focus on action-relevant input regions. By jointly refining actions and constrained observations around inferred fixation points, C3DM reduces reliance on spurious correlations and achieves high 6-DoF manipulation accuracy with minimal demonstrations, including five in sim-to-real scenarios. Empirically, C3DM outperforms diffusion-policy baselines across 5 simulated tasks and demonstrates robust real-robot and sim-to-real performance with both RGB and depth inputs. The work offers a practical, data-efficient approach to robust visuomotor imitation, enabling reliable deployment in cluttered, real-world environments.

Abstract

Behavior Cloning (BC) methods are effective at learning complex manipulation tasks. However, they are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency and robustness against distractors for solving manipulation tasks with a complex action space. We present \textbf{C}onstrained-\textbf{C}ontext \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with robustness to distractions that can learn deployable robot policies from as little as five demonstrations. A key component of C3DM is a fixation step that helps the action denoiser to focus on task-relevant regions around a predicted fixation point while ignoring distractors in the context. We empirically show that C3DM is robust to out-of-distribution distractors, and consistently achieves high success rates on a wide array of tasks, ranging from table-top manipulation to industrial kitting that require varying levels of precision and robustness to distractors.
Paper Structure (31 sections, 17 equations, 14 figures, 8 tables, 2 algorithms)

This paper contains 31 sections, 17 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: Illustrating 6-DoF action prediction for table-top manipulation using our diffusion model (C3DM), which implicitly learns to fixate on relevant parts of the input and iteratively refines its prediction using action-relevant details about the observation.
  • Figure 2: Constrained-Context Conditional Diffusion Model (C3DM) for visuomotor policy learning. Here we illustrate the iterative action-refinement procedure (4 timesteps) using our fixation-while-denoising diffusion process (fDDP), wherein we constrain our input context around a "fixation point" predicted by the model ($\wedge$) at each refinement step. Subsequently, we refine the predicted action by fixating only on the useful part of the context, hence removing distractions and making use of higher levels of detail in the input.
  • Figure 3: (a) Action inference and generation in Diffusion Policy, compared with (b) inference and generation of observation-action tuples in C3DM. Filled circles represent observed variables. In (b), solid lines represent the generative distributions, $p_\theta( \mathbf{O}_{t-1} | \mathbf{X}_{t} )$ for generating latent (fixated) observations and $p_\theta( \mathbf{a}_{t-1} | \mathbf{X}_{t} )$ for the next latent (noisy) actions. Solid and dashed lines together show the inference (de-noising) distributions, $q(\textbf{O}_{t-1} |\ \textbf{O}_t, \mathbf{X}_0)$ and $q(\textbf{a}_{t-1} |\ \textbf{X}_{t}, \textbf{a}_0)$ for inferring latent observations and actions respectively.
  • Figure 4: Illustrating masking (top) and zooming (bottom) for constraining context around predicted fixation point ($\wedge$).
  • Figure 5: (Top) Success rates for manipulation tasks in simulation (average across 100 rollouts, peak performance in 500 epochs of training. (Bottom) Illustration of the simulated evaluation tasks.
  • ...and 9 more figures