Table of Contents
Fetching ...

3D Equivariant Visuomotor Policy Learning via Spherical Projection

Boce Hu, Dian Wang, David Klee, Heng Tian, Xupeng Zhu, Haojie Huang, Robert Platt, Robin Walters

TL;DR

This work develops Image-to-Sphere Policy (ISP), the first SO(3)-equivariant visuomotor policy that learns from monocular eye-in-hand RGB input by lifting 2D features to a sphere and applying an equivariance-corrected, sphere-based representation. ISP integrates an SO(3)-equivariant observation encoder with an SO(3)-equivariant diffusion module, enforcing end-to-end symmetry to global $SO(3)$ rotations and local $SO(2)$ invariances to camera roll, thereby improving data efficiency and generalization. The approach yields state-of-the-art performance on 12 MimicGen simulation tasks and four real-world tasks, achieving up to 42.5% gains in real-world settings with fewer demonstrations and real-time inference. These results demonstrate the practical viability of monocular RGB-based, symmetry-aware visuomotor control for robust 3D manipulation in dynamic, real-world environments.

Abstract

Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in $\mathrm{SO}(3)$ without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work, Image-to-Sphere Policy ($\textbf{ISP}$), is the first $\mathrm{SO}(3)$-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.

3D Equivariant Visuomotor Policy Learning via Spherical Projection

TL;DR

This work develops Image-to-Sphere Policy (ISP), the first SO(3)-equivariant visuomotor policy that learns from monocular eye-in-hand RGB input by lifting 2D features to a sphere and applying an equivariance-corrected, sphere-based representation. ISP integrates an SO(3)-equivariant observation encoder with an SO(3)-equivariant diffusion module, enforcing end-to-end symmetry to global rotations and local invariances to camera roll, thereby improving data efficiency and generalization. The approach yields state-of-the-art performance on 12 MimicGen simulation tasks and four real-world tasks, achieving up to 42.5% gains in real-world settings with fewer demonstrations and real-time inference. These results demonstrate the practical viability of monocular RGB-based, symmetry-aware visuomotor control for robust 3D manipulation in dynamic, real-world environments.

Abstract

Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work, Image-to-Sphere Policy (), is the first -equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.

Paper Structure

This paper contains 28 sections, 2 theorems, 8 equations, 12 figures, 7 tables.

Key Result

Proposition 1

The map $\mathcal{C}\colon (I, R_x) \mapsto R_x$, which assigns each camera image to its corresponding camera pose $R_x\in \mathrm{SO}(3)$ is an equivariance correction. The corrected signal $\Phi_{\text{corr}}(x)=\rho(\mathcal{C}(x))\Phi(x)=\rho(R_x)\Phi(x)$ is in a world‑aligned frame. Thus, the m

Figures (12)

  • Figure 1: We propose the first $\mathrm{SO}(3)$-equivariant policy learning framework based on a single eye-in-hand RGB image, where the predicted action sequence transforms equivariantly under the same group action $g \in \mathrm{SO}(3)$ applied to the whole scene.
  • Figure 2: Overview of Image-to-Sphere Policy (ISP) (a) An $\mathrm{SO}(3)$-equivariant observation encoder extracts features from the RGB input, projects them onto the sphere, and applies an equivariance correction using the gripper orientation $R_x$ to account for the camera's dynamic viewpoint (red arrow). The corrected spherical signal $\Phi_{\text{corr}}(x)$ is then processed by spherical convolution layers to extract $\mathrm{SO}(3)$ signals. Proprioceptive inputs are embedded via equivariant linear layers. Both image and proprioceptive features are represented as a set of Fourier coefficients $c_{\ell}$ on $\mathrm{SO}(3)$ and fused (yellow block). (b) The encoded spherical signals are transformed back to the spatial domain via inverse Fourier transform, sampling finite group elements as the conditioning vector for $\mathrm{SO}(3)$-equivariant denoising. The noisy action sequence is processed in the same way, through equivariant linear layers and projected onto the same group elements.
  • Figure 3: Illustration of Equivariance Correction. The left side shows two identical scenes under different global transformations. Since the wrist-mounted camera captures images in its local frame, the resulting images, and thus the projected spherical signals, remain identical across both scenes. By applying the gripper orientation $R$ as an equivariance correction, we align these spherical signals to a common world frame, ensuring their equivariant transformation under global scene rotations.
  • Figure 4: Illustration of translation invariance and rotation equivariance-to-invariance transition.
  • Figure 5: A subset of experimental environments from MimicGen. Left: external view of the task. Right: eye-in-hand observation used in the experiments. The full set of tasks is shown in Appendix \ref{['sec:sim_settings']}.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Definition 1: Equivariance Correction
  • Proposition 1: Equivariance Correction via End-Effector Pose
  • proof
  • Proposition 2: Invariance to $\mathrm{SO}(2)$ Rotation of the Eye-in-hand Camera
  • proof
  • proof