Table of Contents
Fetching ...

EquivAct: SIM(3)-Equivariant Visuomotor Policies beyond Rigid Object Manipulation

Jingyun Yang, Congyue Deng, Jimmy Wu, Rika Antonova, Leonidas Guibas, Jeannette Bohg

TL;DR

EquivAct tackles zero-shot generalization of visuomotor policies to unseen object appearances, scales, and poses, including deformable and articulated objects. It introduces SIM(3)-equivariant networks that jointly learn a 3D visual representation and a closed-loop visuomotor policy, via a two-phase training scheme: contrastive pre-training of a SIM(3)-equivariant encoder on simulation data, followed by training a SIM(3)-equivariant policy from a small set of demonstrations, mapping partial point clouds and end-effector poses to actions. Empirical results in both simulation and real-robot experiments show that EquivAct outperforms augmentation-based or non-equivariant baselines and enables zero-shot transfer to substantially different object sizes, orientations, and appearances. Overall, the work demonstrates that incorporating SIM(3) equivariance and pre-trained 3D representations yields robust, generalizable visuomotor control for a broad class of deformable and articulated manipulation tasks, with practical implications for scalable, data-efficient robot learning.

Abstract

If a robot masters folding a kitchen towel, we would expect it to master folding a large beach towel. However, existing policy learning methods that rely on data augmentation still don't guarantee such generalization. Our insight is to add equivariance to both the visual object representation and policy architecture. We propose EquivAct which utilizes SIM(3)-equivariant network structures that guarantee generalization across all possible object translations, 3D rotations, and scales by construction. EquivAct is trained in two phases. We first pre-train a SIM(3)-equivariant visual representation on simulated scene point clouds. Then, we learn a SIM(3)-equivariant visuomotor policy using a small amount of source task demonstrations. We show that the learned policy directly transfers to objects that substantially differ from demonstrations in scale, position, and orientation. We evaluate our method in three manipulation tasks involving deformable and articulated objects, going beyond typical rigid object manipulation tasks considered in prior work. We conduct experiments both in simulation and in reality. For real robot experiments, our method uses 20 human demonstrations of a tabletop task and transfers zero-shot to a mobile manipulation task in a much larger setup. Experiments confirm that our contrastive pre-training procedure and equivariant architecture offer significant improvements over prior work. Project website: https://equivact.github.io

EquivAct: SIM(3)-Equivariant Visuomotor Policies beyond Rigid Object Manipulation

TL;DR

EquivAct tackles zero-shot generalization of visuomotor policies to unseen object appearances, scales, and poses, including deformable and articulated objects. It introduces SIM(3)-equivariant networks that jointly learn a 3D visual representation and a closed-loop visuomotor policy, via a two-phase training scheme: contrastive pre-training of a SIM(3)-equivariant encoder on simulation data, followed by training a SIM(3)-equivariant policy from a small set of demonstrations, mapping partial point clouds and end-effector poses to actions. Empirical results in both simulation and real-robot experiments show that EquivAct outperforms augmentation-based or non-equivariant baselines and enables zero-shot transfer to substantially different object sizes, orientations, and appearances. Overall, the work demonstrates that incorporating SIM(3) equivariance and pre-trained 3D representations yields robust, generalizable visuomotor control for a broad class of deformable and articulated manipulation tasks, with practical implications for scalable, data-efficient robot learning.

Abstract

If a robot masters folding a kitchen towel, we would expect it to master folding a large beach towel. However, existing policy learning methods that rely on data augmentation still don't guarantee such generalization. Our insight is to add equivariance to both the visual object representation and policy architecture. We propose EquivAct which utilizes SIM(3)-equivariant network structures that guarantee generalization across all possible object translations, 3D rotations, and scales by construction. EquivAct is trained in two phases. We first pre-train a SIM(3)-equivariant visual representation on simulated scene point clouds. Then, we learn a SIM(3)-equivariant visuomotor policy using a small amount of source task demonstrations. We show that the learned policy directly transfers to objects that substantially differ from demonstrations in scale, position, and orientation. We evaluate our method in three manipulation tasks involving deformable and articulated objects, going beyond typical rigid object manipulation tasks considered in prior work. We conduct experiments both in simulation and in reality. For real robot experiments, our method uses 20 human demonstrations of a tabletop task and transfers zero-shot to a mobile manipulation task in a much larger setup. Experiments confirm that our contrastive pre-training procedure and equivariant architecture offer significant improvements over prior work. Project website: https://equivact.github.io
Paper Structure (14 sections, 5 equations, 9 figures, 1 table)

This paper contains 14 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 2: Representation learning pipeline. This phase takes paired partial point clouds as inputs, processes them through an equivariant encoder-decoder architecture, then employs a contrastive loss based on invariant point features, yielding equivariant global and local features as output.
  • Figure 3: Visualizations of per-point features. The encoder features are equivariant vector-valued features on the partial point cloud observations and the visualizations are done on their invariant components (channel-wise 2-norms). The decoder features are invariant scalar-valued features on the complete objects. The RGB values are computed via PCA within each task. All point clouds are aligned to the canonical pose for visualization. Top two rows: Objects of different shapes viewed from different camera angles but at the same poses. Both encoder and decoder features show strong correspondences within each state due to the contrastive learning. Bottom row: Objects in a different state (features become different from the two top rows).
  • Figure 4: Policy learning architecture. We first pass a point cloud captured during policy execution through the frozen encoder from the representation learning phase to get local and global equivariant features. These are passed to two VN heads to get target end-effector velocities and open/close actions.
  • Figure 5: Simulation environments.Cloth Folding (left): two grippers fold a piece of cloth by grasping two corners of the cloth. Object Covering (middle): two grippers pick up a cloth at two corners and drag it to fully cover another object. This task tests the ability to handle scenarios with several objects. Box Closing (right): two manipulators close a box with three flaps by first closing the side flaps and then closing the larger front/back flap. This task tests manipulation with articulated objects.
  • Figure 6: Results for simulation experiments. We evaluate 3 manipulation tasks involving deformable and articulated objects. The comparisons with baselines show that our method outperforms prior methods that rely on augmentations to achieve generalization or utilize non-equivariant representations.
  • ...and 4 more figures