Table of Contents
Fetching ...

Canonical Policy: Learning Canonical 3D Representation for SE(3)-Equivariant Policy

Zhiyuan Zhang, Zhengtong Xu, Jai Nanda Lakamsani, Yu She

TL;DR

Canonical Policy introduces a principled 3D canonicalization framework to achieve SE(3)–equivariant imitation learning from point clouds. By estimating a canonical pose with a $SO(3)$-equivariant network (Vector Neuron) and mapping observations and actions into a shared canonical frame, the method enables end-to-end learning with generative policy heads (diffusion/flow) and a Point Cloud Aggregation Encoder. Across 12 simulated tasks and 4 real-world platforms, CP-SO2/Cp-SO3 consistently outperform state-of-the-art baselines, yielding average improvements of about $18\%$ in simulation and $39.7\%$ in real-world experiments, demonstrating strong generalization to unseen objects, appearances, viewpoints, and robot platforms. The work also discusses limitations such as computational overhead and sensitivity to large viewpoint shifts, suggesting avenues for view-invariant encoders and multi-view representations to further enhance scalability and robustness.

Abstract

Visual Imitation learning has achieved remarkable progress in robotic manipulation, yet generalization to unseen objects, scene layouts, and camera viewpoints remains a key challenge. Recent advances address this by using 3D point clouds, which provide geometry-aware, appearance-invariant representations, and by incorporating equivariance into policy architectures to exploit spatial symmetries. However, existing equivariant approaches often lack interpretability and rigor due to unstructured integration of equivariant components. We introduce canonical policy, a principled framework for 3D equivariant imitation learning that unifies 3D point cloud observations under a canonical representation. We first establish a theory of 3D canonical representations, enabling equivariant observation-to-action mappings by grouping both seen and novel point clouds to a canonical representation. We then propose a flexible policy learning pipeline that leverages geometric symmetries from canonical representation and the expressiveness of modern generative models. We validate canonical policy on 12 diverse simulated tasks and 4 real-world manipulation tasks across 16 configurations, involving variations in object color, shape, camera viewpoint, and robot platform. Compared to state-of-the-art imitation learning policies, canonical policy achieves an average improvement of 18.0% in simulation and 39.7% in real-world experiments, demonstrating superior generalization capability and sample efficiency. For more details, please refer to the project website: https://zhangzhiyuanzhang.github.io/cp-website/.

Canonical Policy: Learning Canonical 3D Representation for SE(3)-Equivariant Policy

TL;DR

Canonical Policy introduces a principled 3D canonicalization framework to achieve SE(3)–equivariant imitation learning from point clouds. By estimating a canonical pose with a -equivariant network (Vector Neuron) and mapping observations and actions into a shared canonical frame, the method enables end-to-end learning with generative policy heads (diffusion/flow) and a Point Cloud Aggregation Encoder. Across 12 simulated tasks and 4 real-world platforms, CP-SO2/Cp-SO3 consistently outperform state-of-the-art baselines, yielding average improvements of about in simulation and in real-world experiments, demonstrating strong generalization to unseen objects, appearances, viewpoints, and robot platforms. The work also discusses limitations such as computational overhead and sensitivity to large viewpoint shifts, suggesting avenues for view-invariant encoders and multi-view representations to further enhance scalability and robustness.

Abstract

Visual Imitation learning has achieved remarkable progress in robotic manipulation, yet generalization to unseen objects, scene layouts, and camera viewpoints remains a key challenge. Recent advances address this by using 3D point clouds, which provide geometry-aware, appearance-invariant representations, and by incorporating equivariance into policy architectures to exploit spatial symmetries. However, existing equivariant approaches often lack interpretability and rigor due to unstructured integration of equivariant components. We introduce canonical policy, a principled framework for 3D equivariant imitation learning that unifies 3D point cloud observations under a canonical representation. We first establish a theory of 3D canonical representations, enabling equivariant observation-to-action mappings by grouping both seen and novel point clouds to a canonical representation. We then propose a flexible policy learning pipeline that leverages geometric symmetries from canonical representation and the expressiveness of modern generative models. We validate canonical policy on 12 diverse simulated tasks and 4 real-world manipulation tasks across 16 configurations, involving variations in object color, shape, camera viewpoint, and robot platform. Compared to state-of-the-art imitation learning policies, canonical policy achieves an average improvement of 18.0% in simulation and 39.7% in real-world experiments, demonstrating superior generalization capability and sample efficiency. For more details, please refer to the project website: https://zhangzhiyuanzhang.github.io/cp-website/.

Paper Structure

This paper contains 28 sections, 2 theorems, 34 equations, 15 figures, 13 tables.

Key Result

Proposition 1

Let $\mathbf{x}^\mathrm{de}, \mathbf{y}^\mathrm{de}$ denote two decentered elements within the same equivariant group, and $\mathbf{R}_{\mathbf{x}}, \mathbf{R}_{\mathbf{y}} \in \mathrm{SO}(3)$ be the rotation matrices constructed via Schmidt orthogonalization from the respective SO(3)-equivariant ne are equal, i.e., ${\mathbf{x}}^\mathrm{cn} = {\mathbf{y}}^\mathrm{cn}$.

Figures (15)

  • Figure 1: Illustraion of two distinct equivariant groups $\mathcal{G}_i$ and $\mathcal{G}_j$. Samples within the same group can be aligned via an $\mathrm{SE}$(3) transformation, while samples from different groups cannot be made to coincide.
  • Figure 2: Comparison of $\mathrm{SE}(3)$ augmentation and canonicalization: (a) $\mathrm{SE}(3)$ data augmentation; (b) $\mathrm{SE}(3)$ data canonicalization
  • Figure 3: Vector Neuron framework for estimating $\mathrm{SO}(3)$-equivariant rotation matrix. A graph is built from the decentered point cloud, local features are aggregated into global equivariant features, and Schmidt orthogonalization generates a rotation in $\mathrm{SO}(3)$ or $\mathrm{SO}(2)$ that aligns the input to a canonical frame.
  • Figure 4: Overview of the canonical policy. The input point cloud $\mathcal{G}$ is first centered by subtracting its mean $\mathcal{G}^\mathrm{mn}$, and then processed by an $\mathrm{SO}(3)$-equivariant network $\Phi$ to estimate the object rotation $\mathbf{R}$. This rotation is subsequently used to obtain the canonicalized point cloud $\mathcal{G}^\mathrm{cn}$, where the superscript "cn" denotes the canonical representation. These canonicalized point clouds are then encoded using a point cloud aggregation encoder. At each diffusion timestep $t$, robot proprioception, including end-effector position $\mathbf{s}^{\mathrm{pos}}_t$, orientation $\mathbf{s}^{\mathrm{ori}}_t$, and gripper width $\mathbf{s}^{\mathrm{grip}}_t$, is transformed into the canonical frame via an SE(3) inverse transformation, yielding canonical proprioception. A similar canonicalization is applied to the noisy action $\mathbf{a}_{k,t}$, after which the model predicts canonical actions ${\mathbf{a}}_{k-1,t}^\mathrm{cn}$. The output is finally mapped back to the original observation frame through an SE(3) forward transformation to obtain $\mathbf{a}_{k-1,t}$.
  • Figure 5: Neighborhood aggregation module. Point features are normalized and affine-transformed based on local neighbors. A residual MLP with max pooling is used to extract permutation-invariant point cloud features.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Remark 1
  • Proposition 1
  • proof
  • Proposition 2
  • proof