Eq.Bot: Enhance Robotic Manipulation Learning via Group Equivariant Canonicalization
Jian Deng, Yuandong Wang, Yangfu Zhu, Tao Feng, Tianyu Wo, Zhenzhou Shao
TL;DR
Eq.Bot addresses the lack of geometric guarantees in multimodal robotic manipulation by introducing a universal, model-agnostic canonicalization framework grounded in SE(2) equivariance. It canonicalizes observations, applies a base policy in canonical space, and inverts the transformation to deliver spatially consistent actions, enabling plug-in upgrades to existing policies. The approach supports multiple canonicalization networks (including a G-CNN-based option) and is proven to be equivariant; extensive experiments show substantial gains for CNN-based and Transformer-based methods across Ravens, LIBERO, and real-world UR5e scenarios, with notable improvements in unseen spatial configurations. This work demonstrates strong portability and practical impact, significantly improving robustness and generalization in robotic manipulation without requiring architectural redesigns.
Abstract
Robotic manipulation systems are increasingly deployed across diverse domains. Yet existing multi-modal learning frameworks lack inherent guarantees of geometric consistency, struggling to handle spatial transformations such as rotations and translations. While recent works attempt to introduce equivariance through bespoke architectural modifications, these methods suffer from high implementation complexity, computational cost, and poor portability. Inspired by human cognitive processes in spatial reasoning, we propose Eq.Bot, a universal canonicalization framework grounded in SE(2) group equivariant theory for robotic manipulation learning. Our framework transforms observations into a canonical space, applies an existing policy, and maps the resulting actions back to the original space. As a model-agnostic solution, Eq.Bot aims to endow models with spatial equivariance without requiring architectural modifications. Extensive experiments demonstrate the superiority of Eq.Bot under both CNN-based (e.g., CLIPort) and Transformer-based (e.g., OpenVLA-OFT) architectures over existing methods on various robotic manipulation tasks, where the most significant improvement can reach 50.0%.
