Table of Contents
Fetching ...

SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, Stan Birchfield

TL;DR

SPOT addresses the challenge of learning manipulation from demonstrations by representing task progress as SE($SE(3)$) object pose trajectories relative to the target. It trains a diffusion-based policy that predicts future object trajectories conditioned on pose history, enabling closed-loop action plans that automatically respect intermediate constraints without hand-crafted rules. The method demonstrates strong performance on RLBench with a single-camera setup and succeeds in real-world tasks using only eight iPhone-recorded demonstrations, illustrating cross-embodiment generalization and data efficiency. Overall, SPOT provides a flexible, language-conditioned, object-centric framework that generalizes across environments and embodiments while handling long-horizon manipulation tasks.

Abstract

We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually-crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We systematically evaluate our method on simulation and real-world tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion

SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

TL;DR

SPOT addresses the challenge of learning manipulation from demonstrations by representing task progress as SE() object pose trajectories relative to the target. It trains a diffusion-based policy that predicts future object trajectories conditioned on pose history, enabling closed-loop action plans that automatically respect intermediate constraints without hand-crafted rules. The method demonstrates strong performance on RLBench with a single-camera setup and succeeds in real-world tasks using only eight iPhone-recorded demonstrations, illustrating cross-embodiment generalization and data efficiency. Overall, SPOT provides a flexible, language-conditioned, object-centric framework that generalizes across environments and embodiments while handling long-horizon manipulation tasks.

Abstract

We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually-crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We systematically evaluate our method on simulation and real-world tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion

Paper Structure

This paper contains 12 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: We present SPOT, an imitation learning method that leverages object pose trajectories as an intermediate representation. Given the observation, our framework estimates the object’s pose relative to the target, predicts its future path in SE(3), and derives an action plan accordingly. Our diffusion model is trained on demonstration trajectories extracted from videos without needing action data from the same embodiment.
  • Figure 2: Overview. During training, we extract object pose trajectories from demonstration RGBD videos (e.g., collected with an iPhone), which are independent of the embodiment. Using these extracted trajectories, we train a diffusion model to generate future object trajectories and determine task completion based on current and past poses. During task execution, the task-relevant object is constantly tracked, and its pose is forwarded to the trajectory diffusion model to predict the object's future trajectory in SE(3) that leads to task accomplishment. Finally, we convert the generated trajectories into embodiment-agnostic action plans for closed-loop manipulation.
  • Figure 3: Real-world Tasks and Qualitative Results. Demonstration data was collected using an iPhone to record the RGBD video of a human performing the tasks (Left). The robot deploys the trained policy in drastically different environments, lighting conditions, camera perspectives, and object configurations from demonstration time (Right).
  • Figure 4: Real-world Quantitative Results. Our method outperforms the point-tracking baseline. We categorize failure modes into (i) tracking failure, (ii) placing failure, and (iii) task constraint failure.