Table of Contents
Fetching ...

Correspondence-Oriented Imitation Learning: Flexible Visuomotor Control with 3D Conditioning

Yunhao Cao, Zubin Bhaumik, Jessie Jia, Xingyi He, Kuan Fang

TL;DR

COIL introduces a flexible 3D correspondence-based framework for visuomotor control, where tasks are specified by the 3D trajectories of keypoints on scene objects and can vary in spatial and temporal granularity. A Spatio-Temporal Transformer fuses 3D observations, tracked keypoints, and the task representation to produce multi-step actions via a flow-matching head, trained with self-supervised hindsight correspondence labeling and augmentation. The approach demonstrates strong zero-shot generalization across rigid and deformable tasks, outperforming baselines and handling varying specification densities, with ablations confirming the importance of attention mechanisms and robust training. This work advances scalable, interpretable robot learning by grounding flexible, 3D task representations directly in perception and action.

Abstract

We introduce Correspondence-Oriented Imitation Learning (COIL), a conditional policy learning framework for visuomotor control with a flexible task representation in 3D. At the core of our approach, each task is defined by the intended motion of keypoints selected on objects in the scene. Instead of assuming a fixed number of keypoints or uniformly spaced time intervals, COIL supports task specifications with variable spatial and temporal granularity, adapting to different user intents and task requirements. To robustly ground this correspondence-oriented task representation into actions, we design a conditional policy with a spatio-temporal attention mechanism that effectively fuses information across multiple input modalities. The policy is trained via a scalable self-supervised pipeline using demonstrations collected in simulation, with correspondence labels automatically generated in hindsight. COIL generalizes across tasks, objects, and motion patterns, achieving superior performance compared to prior methods on real-world manipulation tasks under both sparse and dense specifications.

Correspondence-Oriented Imitation Learning: Flexible Visuomotor Control with 3D Conditioning

TL;DR

COIL introduces a flexible 3D correspondence-based framework for visuomotor control, where tasks are specified by the 3D trajectories of keypoints on scene objects and can vary in spatial and temporal granularity. A Spatio-Temporal Transformer fuses 3D observations, tracked keypoints, and the task representation to produce multi-step actions via a flow-matching head, trained with self-supervised hindsight correspondence labeling and augmentation. The approach demonstrates strong zero-shot generalization across rigid and deformable tasks, outperforming baselines and handling varying specification densities, with ablations confirming the importance of attention mechanisms and robust training. This work advances scalable, interpretable robot learning by grounding flexible, 3D task representations directly in perception and action.

Abstract

We introduce Correspondence-Oriented Imitation Learning (COIL), a conditional policy learning framework for visuomotor control with a flexible task representation in 3D. At the core of our approach, each task is defined by the intended motion of keypoints selected on objects in the scene. Instead of assuming a fixed number of keypoints or uniformly spaced time intervals, COIL supports task specifications with variable spatial and temporal granularity, adapting to different user intents and task requirements. To robustly ground this correspondence-oriented task representation into actions, we design a conditional policy with a spatio-temporal attention mechanism that effectively fuses information across multiple input modalities. The policy is trained via a scalable self-supervised pipeline using demonstrations collected in simulation, with correspondence labels automatically generated in hindsight. COIL generalizes across tasks, objects, and motion patterns, achieving superior performance compared to prior methods on real-world manipulation tasks under both sparse and dense specifications.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We introduce COIL, an approach for versatile manipulation conditioned on a correspondence-oriented task representation in 3D. Each task is defined by a set of keypoints annotated on the observed point cloud of scene objects, with task goals and constraints expressed as their intended 3D trajectories. Unlike prior work that assumes a fixed number of keypoints or densely sampled time steps, COIL supports task specifications with variable spatial and temporal granularity, allowing users or planners to adapt the level of detail based on the task's complexity or intent.
  • Figure 2: Overview of COIL Policy. Our policy encodes the task representation, tracked keypoints, and observed point cloud using shared 3D coordinate encoders. Temporal information is injected via normalized positional encodings. A Spatio-Temporal Transformer efficiently fuses these inputs by interleaving spatial and temporal self-attention and applying cross-attention with the visual observations. The resulting representation is combined with proprioception and passed to a flow-matching head to generate multi-step actions. This design enables effective grounding of task specifications of varying spatial and temporal granularities into precise, executable actions.
  • Figure 3: Task Execution. From left to right: selected input specification to the policy for each evaluation task, execution of our policy, and the robot's achieved end-effector trajectory. Our policy demonstrates behaviors to flexibly adapt to the objects in scene by rotating the gripper to avoid hitting the drawer in the Pick-and-Place task and correctly manipulating the side of the scarf in the Folding task. It also demonstrates accurate trajectory following capabilities in the Sweeping task.
  • Figure 4: Failure Analysis. We categorize real-world failure cases into perception failures and execution failures, and visualize their proportions across all evaluation tasks. The majority of failures stem from inaccuracies in point tracking, particularly under occlusion or clutter. Execution failures most commonly occur during the grasping phase , often when objects are flat or lack distinctive geometry, making it difficult for the policy to localize reliable grasp points.