3D Equivariant Visuomotor Policy Learning via Spherical Projection
Boce Hu, Dian Wang, David Klee, Heng Tian, Xupeng Zhu, Haojie Huang, Robert Platt, Robin Walters
TL;DR
This work develops Image-to-Sphere Policy (ISP), the first SO(3)-equivariant visuomotor policy that learns from monocular eye-in-hand RGB input by lifting 2D features to a sphere and applying an equivariance-corrected, sphere-based representation. ISP integrates an SO(3)-equivariant observation encoder with an SO(3)-equivariant diffusion module, enforcing end-to-end symmetry to global $SO(3)$ rotations and local $SO(2)$ invariances to camera roll, thereby improving data efficiency and generalization. The approach yields state-of-the-art performance on 12 MimicGen simulation tasks and four real-world tasks, achieving up to 42.5% gains in real-world settings with fewer demonstrations and real-time inference. These results demonstrate the practical viability of monocular RGB-based, symmetry-aware visuomotor control for robust 3D manipulation in dynamic, real-world environments.
Abstract
Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in $\mathrm{SO}(3)$ without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work, Image-to-Sphere Policy ($\textbf{ISP}$), is the first $\mathrm{SO}(3)$-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.
