SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation
Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, Stan Birchfield
TL;DR
SPOT addresses the challenge of learning manipulation from demonstrations by representing task progress as SE($SE(3)$) object pose trajectories relative to the target. It trains a diffusion-based policy that predicts future object trajectories conditioned on pose history, enabling closed-loop action plans that automatically respect intermediate constraints without hand-crafted rules. The method demonstrates strong performance on RLBench with a single-camera setup and succeeds in real-world tasks using only eight iPhone-recorded demonstrations, illustrating cross-embodiment generalization and data efficiency. Overall, SPOT provides a flexible, language-conditioned, object-centric framework that generalizes across environments and embodiments while handling long-horizon manipulation tasks.
Abstract
We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually-crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We systematically evaluate our method on simulation and real-world tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion
