Table of Contents
Fetching ...

EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation

Xupeng Zhu, Yu Qi, Yizhe Zhu, Robin Walters, Robert Platt

TL;DR

EquAct tackles the challenge of geometric generalization in language-conditioned 3D manipulation by enforcing continuous SE(3) equivariance in the policy while keeping language conditioning invariant to scene transformations. It integrates an SE(3)-equivariant Point Transformer U-net (EPTU) with spherical Fourier features and an SE(3)-invariant iFiLM layer to fuse language, plus equivariant field networks to evaluate translational, rotational, and gripper actions in a unified, single-forward model. The approach is theoretically grounded with proofs of equivariance and invariance and empirically validated on 18 RLBench tasks with SE(2) and SE(3) initializations plus 4 real-world tasks, achieving state-of-the-art performance and strong 3D generalization. The work demonstrates that preserving geometric structure directly in the network architecture, along with language grounding that respects SE(3) symmetries, yields faster, more reliable generalization to unseen 3D configurations and perturbations, with practical implications for scalable, language-conditioned robotic manipulation.

Abstract

Transformer architectures can effectively learn language-conditioned, multi-task 3D open-loop manipulation policies from demonstrations by jointly processing natural language instructions and 3D observations. However, although both the robot policy and language instructions inherently encode rich 3D geometric structures, standard transformers lack built-in guarantees of geometric consistency, often resulting in unpredictable behavior under SE(3) transformations of the scene. In this paper, we leverage SE(3) equivariance as a key structural property shared by both policy and language, and propose EquAct-a novel SE(3)-equivariant multi-task transformer. EquAct is theoretically guaranteed to be SE(3) equivariant and consists of two key components: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. To evaluate its spatial generalization ability, we benchmark EquAct on 18 RLBench simulation tasks with both SE(3) and SE(2) scene perturbations, and on 4 physical tasks. EquAct performs state-of-the-art across these simulation and physical tasks.

EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation

TL;DR

EquAct tackles the challenge of geometric generalization in language-conditioned 3D manipulation by enforcing continuous SE(3) equivariance in the policy while keeping language conditioning invariant to scene transformations. It integrates an SE(3)-equivariant Point Transformer U-net (EPTU) with spherical Fourier features and an SE(3)-invariant iFiLM layer to fuse language, plus equivariant field networks to evaluate translational, rotational, and gripper actions in a unified, single-forward model. The approach is theoretically grounded with proofs of equivariance and invariance and empirically validated on 18 RLBench tasks with SE(2) and SE(3) initializations plus 4 real-world tasks, achieving state-of-the-art performance and strong 3D generalization. The work demonstrates that preserving geometric structure directly in the network architecture, along with language grounding that respects SE(3) symmetries, yields faster, more reliable generalization to unseen 3D configurations and perturbations, with practical implications for scalable, language-conditioned robotic manipulation.

Abstract

Transformer architectures can effectively learn language-conditioned, multi-task 3D open-loop manipulation policies from demonstrations by jointly processing natural language instructions and 3D observations. However, although both the robot policy and language instructions inherently encode rich 3D geometric structures, standard transformers lack built-in guarantees of geometric consistency, often resulting in unpredictable behavior under SE(3) transformations of the scene. In this paper, we leverage SE(3) equivariance as a key structural property shared by both policy and language, and propose EquAct-a novel SE(3)-equivariant multi-task transformer. EquAct is theoretically guaranteed to be SE(3) equivariant and consists of two key components: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. To evaluate its spatial generalization ability, we benchmark EquAct on 18 RLBench simulation tasks with both SE(3) and SE(2) scene perturbations, and on 4 physical tasks. EquAct performs state-of-the-art across these simulation and physical tasks.

Paper Structure

This paper contains 40 sections, 4 theorems, 15 equations, 6 figures, 5 tables.

Key Result

Proposition 4.1

EquAct is $\mathrm{SE}(3)$-equivariant in observation-action mapping and $\mathrm{SE}(3)$-invariant to nature language instruction, as described in Equation equ:equact.

Figures (6)

  • Figure 1: Overview of EquAct. EquAct first encodes the observation $o = \{s, e\}$ into latent spherical features $h$ using a $\mathrm{SE}(3)$-equivariant U-Net, $enc_o$, while conditioning the natural language instruction $n$ through invariant iFiLM layers. Based on the encoded features $h$, EquAct then samples and refines translational query actions and gripper open actions using an equivariant field network, resulting in action value functions $Q_t$ and $Q_{\mathrm{open}}$. Finally, a rotational field network aggregates spherical features from $h$ centered at the predicted translation $a_t^*$ to obtain a latent feature $\phi$, which is subsequently convolved with a learned filter $\psi$ to produce the rotational action value function $Q_r$.
  • Figure 2: The equivariance and invariance of the multi-task keyframe policy. Under the equivariance assumption, when the observation is transformed to $g \cdot o$, the predicted action transforms accordingly to $g \cdot a$. Under the invariance assumption, given a fixed natural language instruction $n$, the action transformation depends solely on the transformation applied to the observation.
  • Figure 3: $\mathrm{SE}(3)$-Equivariant Point Transformer U-net (EPTU).
  • Figure 4: Simulation and physical experiments. First row: 18 standard RLBench tasksshridhar2023perceiverjames2020rlbench. Second row: 18 RLBench tasks with $\mathrm{SE}(3)$ randomization. Third row: 4 physical experiments. A language instruction specifies each variant of the task.
  • Figure 5: $4$ Physical tasks.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition 4.4
  • proof
  • proof
  • proof
  • proof