EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation
Xupeng Zhu, Yu Qi, Yizhe Zhu, Robin Walters, Robert Platt
TL;DR
EquAct tackles the challenge of geometric generalization in language-conditioned 3D manipulation by enforcing continuous SE(3) equivariance in the policy while keeping language conditioning invariant to scene transformations. It integrates an SE(3)-equivariant Point Transformer U-net (EPTU) with spherical Fourier features and an SE(3)-invariant iFiLM layer to fuse language, plus equivariant field networks to evaluate translational, rotational, and gripper actions in a unified, single-forward model. The approach is theoretically grounded with proofs of equivariance and invariance and empirically validated on 18 RLBench tasks with SE(2) and SE(3) initializations plus 4 real-world tasks, achieving state-of-the-art performance and strong 3D generalization. The work demonstrates that preserving geometric structure directly in the network architecture, along with language grounding that respects SE(3) symmetries, yields faster, more reliable generalization to unseen 3D configurations and perturbations, with practical implications for scalable, language-conditioned robotic manipulation.
Abstract
Transformer architectures can effectively learn language-conditioned, multi-task 3D open-loop manipulation policies from demonstrations by jointly processing natural language instructions and 3D observations. However, although both the robot policy and language instructions inherently encode rich 3D geometric structures, standard transformers lack built-in guarantees of geometric consistency, often resulting in unpredictable behavior under SE(3) transformations of the scene. In this paper, we leverage SE(3) equivariance as a key structural property shared by both policy and language, and propose EquAct-a novel SE(3)-equivariant multi-task transformer. EquAct is theoretically guaranteed to be SE(3) equivariant and consists of two key components: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. To evaluate its spatial generalization ability, we benchmark EquAct on 18 RLBench simulation tasks with both SE(3) and SE(2) scene perturbations, and on 4 physical tasks. EquAct performs state-of-the-art across these simulation and physical tasks.
