Table of Contents
Fetching ...

Eq.Bot: Enhance Robotic Manipulation Learning via Group Equivariant Canonicalization

Jian Deng, Yuandong Wang, Yangfu Zhu, Tao Feng, Tianyu Wo, Zhenzhou Shao

TL;DR

Eq.Bot addresses the lack of geometric guarantees in multimodal robotic manipulation by introducing a universal, model-agnostic canonicalization framework grounded in SE(2) equivariance. It canonicalizes observations, applies a base policy in canonical space, and inverts the transformation to deliver spatially consistent actions, enabling plug-in upgrades to existing policies. The approach supports multiple canonicalization networks (including a G-CNN-based option) and is proven to be equivariant; extensive experiments show substantial gains for CNN-based and Transformer-based methods across Ravens, LIBERO, and real-world UR5e scenarios, with notable improvements in unseen spatial configurations. This work demonstrates strong portability and practical impact, significantly improving robustness and generalization in robotic manipulation without requiring architectural redesigns.

Abstract

Robotic manipulation systems are increasingly deployed across diverse domains. Yet existing multi-modal learning frameworks lack inherent guarantees of geometric consistency, struggling to handle spatial transformations such as rotations and translations. While recent works attempt to introduce equivariance through bespoke architectural modifications, these methods suffer from high implementation complexity, computational cost, and poor portability. Inspired by human cognitive processes in spatial reasoning, we propose Eq.Bot, a universal canonicalization framework grounded in SE(2) group equivariant theory for robotic manipulation learning. Our framework transforms observations into a canonical space, applies an existing policy, and maps the resulting actions back to the original space. As a model-agnostic solution, Eq.Bot aims to endow models with spatial equivariance without requiring architectural modifications. Extensive experiments demonstrate the superiority of Eq.Bot under both CNN-based (e.g., CLIPort) and Transformer-based (e.g., OpenVLA-OFT) architectures over existing methods on various robotic manipulation tasks, where the most significant improvement can reach 50.0%.

Eq.Bot: Enhance Robotic Manipulation Learning via Group Equivariant Canonicalization

TL;DR

Eq.Bot addresses the lack of geometric guarantees in multimodal robotic manipulation by introducing a universal, model-agnostic canonicalization framework grounded in SE(2) equivariance. It canonicalizes observations, applies a base policy in canonical space, and inverts the transformation to deliver spatially consistent actions, enabling plug-in upgrades to existing policies. The approach supports multiple canonicalization networks (including a G-CNN-based option) and is proven to be equivariant; extensive experiments show substantial gains for CNN-based and Transformer-based methods across Ravens, LIBERO, and real-world UR5e scenarios, with notable improvements in unseen spatial configurations. This work demonstrates strong portability and practical impact, significantly improving robustness and generalization in robotic manipulation without requiring architectural redesigns.

Abstract

Robotic manipulation systems are increasingly deployed across diverse domains. Yet existing multi-modal learning frameworks lack inherent guarantees of geometric consistency, struggling to handle spatial transformations such as rotations and translations. While recent works attempt to introduce equivariance through bespoke architectural modifications, these methods suffer from high implementation complexity, computational cost, and poor portability. Inspired by human cognitive processes in spatial reasoning, we propose Eq.Bot, a universal canonicalization framework grounded in SE(2) group equivariant theory for robotic manipulation learning. Our framework transforms observations into a canonical space, applies an existing policy, and maps the resulting actions back to the original space. As a model-agnostic solution, Eq.Bot aims to endow models with spatial equivariance without requiring architectural modifications. Extensive experiments demonstrate the superiority of Eq.Bot under both CNN-based (e.g., CLIPort) and Transformer-based (e.g., OpenVLA-OFT) architectures over existing methods on various robotic manipulation tasks, where the most significant improvement can reach 50.0%.

Paper Structure

This paper contains 23 sections, 32 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance comparison on Robotic Manipulation Benchmarks. Our model-agnostic framework significantly boosts the performance of both the CNN-based CLIPort (left) and the Transformer-based OpenVLA-OFT (right). On the pack-unseen-box task (demo = 100), Eq.Bot improves CLIPort's success rate from 62.4% to 93.6%. For OpenVLA-OFT, our method consistently enhances performance on the LIBERO benchmarks.
  • Figure 2: Overview of the proposed Eq.Bot framework, a model-agnostic solution that enhances spatial equivariance to existing manipulation systems. Grounded in equivariant theory, Eq.Bot introduces a canonicalization process that transforms input observations into standardized canonical orientations. These canonicalized observations are then fed into an unmodified base policy (e.g., CLIPort or OpenVLA) to generate actions. Finally, the resulting actions are mapped back to the original space through inverse transformation for execution.
  • Figure 3: Real-world Tasks. Three robotic tabletop tasks are used for evaluation: (1) pack objects, (2) stack blocks, and (3) place toy. Each sub-figure displays the robot's action (left) and the corresponding objects (right).
  • Figure 4: Real-world results. We compare our Eq.Bot variants with the original CLIPORT baseline, showing performance on both seen (in-distribution) and unseen (out-of-distribution) tasks.