Table of Contents
Fetching ...

EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-rich Tasks

Joohwan Seo, Arvind Kruthiventy, Soomi Lee, Megan Teng, Xiang Zhang, Seoyeon Choi, Jongeun Choi, Roberto Horowitz

TL;DR

EquiContact presents a hierarchical SE(3) vision-to-force policy for spatially generalizable contact-rich manipulation, combining a high-level Diffusion Equivariant Descriptor Field (Diff-EDF) with a low-level Geometric Compliant ACT (G-CompACT) and a geometric admittance controller (GAC). The framework relies on three design principles—compliance, localized policies, and induced equivariance—to achieve SE(3) equivariance from perception to force control, validated on peg-in-hole, screwing, and surface wiping with strong generalization to unseen spatial configurations. Key contributions include the provable SE(3) equivariance of the EquiContact pipeline, the left-invariant design of G-CompACT, and experimental evidence that the approach outperforms baselines in both in-distribution and out-of-distribution scenarios, while maintaining low interaction forces. The work offers a structured, interpretable blueprint for building spatially generalizable manipulation policies, complementing data-driven end-to-end methods and providing a practical path toward real-world contact-rich robot assistance.

Abstract

This paper presents a framework for learning vision-based robotic policies for contact-rich manipulation tasks that generalize spatially across task configurations. We focus on achieving robust spatial generalization of the policy for the peg-in-hole (PiH) task trained from a small number of demonstrations. We propose EquiContact, a hierarchical policy composed of a high-level vision planner (Diffusion Equivariant Descriptor Field, Diff-EDF) and a novel low-level compliant visuomotor policy (Geometric Compliant ACT, G-CompACT). G-CompACT operates using only localized observations (geometrically consistent error vectors (GCEV), force-torque readings, and wrist-mounted RGB images) and produces actions defined in the end-effector frame. Through these design choices, we show that the entire EquiContact pipeline is SE(3)-equivariant, from perception to force control. We also outline three key components for spatially generalizable contact-rich policies: compliance, localized policies, and induced equivariance. Real-world experiments on PiH, screwing, and surface wiping tasks demonstrate a near-perfect success rate and robust generalization to unseen spatial configurations, validating the proposed framework and principles. The experimental videos can be found on the project website: https://sites.google.com/berkeley.edu/equicontact

EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-rich Tasks

TL;DR

EquiContact presents a hierarchical SE(3) vision-to-force policy for spatially generalizable contact-rich manipulation, combining a high-level Diffusion Equivariant Descriptor Field (Diff-EDF) with a low-level Geometric Compliant ACT (G-CompACT) and a geometric admittance controller (GAC). The framework relies on three design principles—compliance, localized policies, and induced equivariance—to achieve SE(3) equivariance from perception to force control, validated on peg-in-hole, screwing, and surface wiping with strong generalization to unseen spatial configurations. Key contributions include the provable SE(3) equivariance of the EquiContact pipeline, the left-invariant design of G-CompACT, and experimental evidence that the approach outperforms baselines in both in-distribution and out-of-distribution scenarios, while maintaining low interaction forces. The work offers a structured, interpretable blueprint for building spatially generalizable manipulation policies, complementing data-driven end-to-end methods and providing a practical path toward real-world contact-rich robot assistance.

Abstract

This paper presents a framework for learning vision-based robotic policies for contact-rich manipulation tasks that generalize spatially across task configurations. We focus on achieving robust spatial generalization of the policy for the peg-in-hole (PiH) task trained from a small number of demonstrations. We propose EquiContact, a hierarchical policy composed of a high-level vision planner (Diffusion Equivariant Descriptor Field, Diff-EDF) and a novel low-level compliant visuomotor policy (Geometric Compliant ACT, G-CompACT). G-CompACT operates using only localized observations (geometrically consistent error vectors (GCEV), force-torque readings, and wrist-mounted RGB images) and produces actions defined in the end-effector frame. Through these design choices, we show that the entire EquiContact pipeline is SE(3)-equivariant, from perception to force control. We also outline three key components for spatially generalizable contact-rich policies: compliance, localized policies, and induced equivariance. Real-world experiments on PiH, screwing, and surface wiping tasks demonstrate a near-perfect success rate and robust generalization to unseen spatial configurations, validating the proposed framework and principles. The experimental videos can be found on the project website: https://sites.google.com/berkeley.edu/equicontact

Paper Structure

This paper contains 33 sections, 3 theorems, 28 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Suppose that the Assumption assump:2 holds. Then,

Figures (6)

  • Figure 1: We propose an EquiContact, a hierarchical, provably $SE(3)$ vision-to-force equivariant policy for spatially generalizable contact-rich tasks. The proposed EquiContact consists of G-CompACT (Section. \ref{['Sec:G-CompACT']}) and Diffusion-EDF (Section. \ref{['Sec:Diff-EDF']}). The G-Compact plays a localized policy over the reference frame provided by the Diffusion-EDF, making our framework generalizable to unseen task transformations during evaluation. The overall EquiContact algorithm is summarized in Algorithm \ref{['alg:EquiContact']}.
  • Figure 2: (Left) Overview of the workspace for the peg-in-hole assembly task is presented. $2$ external cameras with calibrated extrinsics and $2$ wrists cameras are installed. (Right-Top) Peg and hole assembly with $1 \mathrm{mm}$ of clearance. (Right-Bottom) Hole part with flat and tilted ($30^\circ$) platforms.
  • Figure 3: Effects of the left group action $g_l$ to the end-effector pose $g$ and the reference frame $g_{ref}$, and to the wrists cameras $I_{w,1}$ and $I_{w,2}$. As the left group action is applied on the end-effector, the wrist cameras start to see backgrounds other than the optical table. $\{s\}$ denotes a spatial frame, i.e., robot base frame or world frame.
  • Figure 4: Force profiles of CompACT and ACT with GAC (fixed gains) during insertion tasks are presented. The CompACT with force-torque sensor inputs and output gains shows lower interaction force in all directions.
  • Figure 5: (Left) Screwing task, (Right) Surface wiping (erasing) task. The same platform structure of the PiH is used.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 1: Left-invariance of G-CompACT
  • Corollary 1: $SE(3)$ Equivariance of G-CompACT
  • proof
  • Proposition 2