C3DM: Constrained-Context Conditional Diffusion Models for Imitation Learning
Vaibhav Saxena, Yotto Koga, Danfei Xu
TL;DR
This paper tackles the fragility of behavior-cloning-based visuomotor policies in the presence of distractors by introducing C3DM, a constrained-context diffusion model that uses a fixation-driven, iterative refinement process (fDDP) to focus on action-relevant input regions. By jointly refining actions and constrained observations around inferred fixation points, C3DM reduces reliance on spurious correlations and achieves high 6-DoF manipulation accuracy with minimal demonstrations, including five in sim-to-real scenarios. Empirically, C3DM outperforms diffusion-policy baselines across 5 simulated tasks and demonstrates robust real-robot and sim-to-real performance with both RGB and depth inputs. The work offers a practical, data-efficient approach to robust visuomotor imitation, enabling reliable deployment in cluttered, real-world environments.
Abstract
Behavior Cloning (BC) methods are effective at learning complex manipulation tasks. However, they are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency and robustness against distractors for solving manipulation tasks with a complex action space. We present \textbf{C}onstrained-\textbf{C}ontext \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with robustness to distractions that can learn deployable robot policies from as little as five demonstrations. A key component of C3DM is a fixation step that helps the action denoiser to focus on task-relevant regions around a predicted fixation point while ignoring distractors in the context. We empirically show that C3DM is robust to out-of-distribution distractors, and consistently achieves high success rates on a wide array of tasks, ranging from table-top manipulation to industrial kitting that require varying levels of precision and robustness to distractors.
