Correspondence-Oriented Imitation Learning: Flexible Visuomotor Control with 3D Conditioning
Yunhao Cao, Zubin Bhaumik, Jessie Jia, Xingyi He, Kuan Fang
TL;DR
COIL introduces a flexible 3D correspondence-based framework for visuomotor control, where tasks are specified by the 3D trajectories of keypoints on scene objects and can vary in spatial and temporal granularity. A Spatio-Temporal Transformer fuses 3D observations, tracked keypoints, and the task representation to produce multi-step actions via a flow-matching head, trained with self-supervised hindsight correspondence labeling and augmentation. The approach demonstrates strong zero-shot generalization across rigid and deformable tasks, outperforming baselines and handling varying specification densities, with ablations confirming the importance of attention mechanisms and robust training. This work advances scalable, interpretable robot learning by grounding flexible, 3D task representations directly in perception and action.
Abstract
We introduce Correspondence-Oriented Imitation Learning (COIL), a conditional policy learning framework for visuomotor control with a flexible task representation in 3D. At the core of our approach, each task is defined by the intended motion of keypoints selected on objects in the scene. Instead of assuming a fixed number of keypoints or uniformly spaced time intervals, COIL supports task specifications with variable spatial and temporal granularity, adapting to different user intents and task requirements. To robustly ground this correspondence-oriented task representation into actions, we design a conditional policy with a spatio-temporal attention mechanism that effectively fuses information across multiple input modalities. The policy is trained via a scalable self-supervised pipeline using demonstrations collected in simulation, with correspondence labels automatically generated in hindsight. COIL generalizes across tasks, objects, and motion patterns, achieving superior performance compared to prior methods on real-world manipulation tasks under both sparse and dense specifications.
