Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

Haojie Huang; Karl Schmeckpeper; Dian Wang; Ondrej Biza; Yaoyao Qian; Haotian Liu; Mingxi Jia; Robert Platt; Robin Walters

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

Haojie Huang, Karl Schmeckpeper, Dian Wang, Ondrej Biza, Yaoyao Qian, Haotian Liu, Mingxi Jia, Robert Platt, Robin Walters

TL;DR

Imagination Policy tackles high-precision robotic manipulation by replacing direct action inference with a generative approach that imagines target configurations from two input point clouds using a conditional point-flow model. Rigid motions are then recovered via point-cloud registration, yielding $SE(3)$ actions, with a bi-equivariant design that leverages task symmetries to improve sample efficiency and generalization. The method shows state-of-the-art performance on RLBench across challenging tasks and validates the approach on a real UR5 robot, while also providing ablations and extensions to longer-horizon and articulated-object scenarios. Limitations include reliance on segmented point clouds and diffusion-based inference speed, suggesting directions for faster inference and broader object categories in future work.

Abstract

Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation. This transforms action inference into a local generative task. We leverage pick and place symmetries underlying the tasks in the generation process and achieve extremely high sample efficiency and generalizability to unseen configurations. Finally, we demonstrate state-of-the-art performance across various tasks on the RLbench benchmark compared with several strong baselines and validate our approach on a real robot.

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

TL;DR

actions, with a bi-equivariant design that leverages task symmetries to improve sample efficiency and generalization. The method shows state-of-the-art performance on RLBench across challenging tasks and validates the approach on a real UR5 robot, while also providing ablations and extensions to longer-horizon and articulated-object scenarios. Limitations include reliance on segmented point clouds and diffusion-based inference speed, suggesting directions for faster inference and broader object categories in future work.

Abstract

Paper Structure (16 sections, 2 theorems, 6 equations, 10 figures, 6 tables)

This paper contains 16 sections, 2 theorems, 6 equations, 10 figures, 6 tables.

Introduction
Related Work
Method
Pair Generation for Place
Single Generation for Pick
Experiments
3D Key-frame Pick and Place
Real Robot Experiment
Conclusion
Appendix
Ablation Study
Task with Longer Horizon
Task with Articulated Object
Baseline Details
Real-robot Experiments Pipeline
...and 1 more sections

Key Result

Proposition 1

Assuming rotation-invariant Gaussian noise $X_0$, if the encoded point feature $F_a$ and $F_b$ are invariant to rotations then $f_{\mathrm{place}}$ is bi-equivariant for all pairs of rotations $(g_a,g_b)\in \mathrm{SO}(3) \times \mathrm{SO}(3)$.

Figures (10)

Figure 1: Illustration of pick generation and place generation. The pick generator generates the points of the object to be picked conditioned on the gripper point cloud. The place generator generates two new objects repositioned together. The generated points are colored in orange.
Figure 2: Architecture of Imagination Policy. (a). Encoding the observed point features as $F_a$ and $F_b$. (b). Conditional pair generation of the place scene from random Gaussian noise. $x_t^k$ illustrates the $k$-th noise at time step $t$ with the point feature $f^{k}$ and $f_{\ell}$ is the language feature. (c). Estimating the rigid transformation ($T_a$ and $T_b$) from the observed point cloud to the generation using correspondence.
Figure 3: Trajectory of the pick generation process ("grasp the banana by the crown"). Unlike the place generation, our pick generation is conditioned on the canonicalized gripper point cloud. The generated point cloud at each timestep is colored in orange.
Figure 4: Illustration of the keyframe pipeline of $\textsc{Imagination Policy}$ on Insert-Knife: (a) the RGB-D image captured by the front camera and the segmented point clouds, (b) pick generation, (c) preplace generation, and (d) place generation. The top row shows the generated points with orange color and the bottom row demonstrates the configurations of pick, preplace, and place with the calculated rigid transformations.
Figure 5: 3D pick-place tasks from RLBench james2020rlbench. From left to right the tasks are: Phone-on-Base, Stack-Wine, Put-Plate, Put-Roll, Plug-Charger, and Insert-Knife. The top row shows the initial scene and the bottom row shows the completion state.
...and 5 more figures

Theorems & Definitions (3)

Proposition 1
proof
Proposition 2

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

TL;DR

Abstract

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)