Table of Contents
Fetching ...

EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Jingyun Yang, Zi-ang Cao, Congyue Deng, Rika Antonova, Shuran Song, Jeannette Bohg

TL;DR

EquiBot introduces a SIM(3)-equivariant diffusion-based visuomotor policy for robot manipulation, enabling strong generalization to unseen object poses and scales from limited demonstrations. By embedding SIM(3)-equivariant encoders and an SO(3)-equivariant conditional U-net within a diffusion framework, it yields action distributions that respect 3D transformations and support multi-modal behaviors. Across six simulated and six real tasks, EquiBot demonstrates superior data efficiency and robustness to out-of-distribution scenarios compared with vanilla diffusion policies and prior equivariant approaches. The work advances practical, sample-efficient robotic learning, with implications for real-world deployment using minimal human demonstrations.

Abstract

Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning. We propose Equibot, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Our approach combines SIM(3)-equivariant neural network architectures with diffusion models. This ensures that our learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning such as multi-modality and robustness. We show on a suite of 6 simulation tasks that our proposed method reduces the data requirements and improves generalization to novel scenarios. In the real world, with 10 variations of 6 mobile manipulation tasks, we show that our method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.

EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

TL;DR

EquiBot introduces a SIM(3)-equivariant diffusion-based visuomotor policy for robot manipulation, enabling strong generalization to unseen object poses and scales from limited demonstrations. By embedding SIM(3)-equivariant encoders and an SO(3)-equivariant conditional U-net within a diffusion framework, it yields action distributions that respect 3D transformations and support multi-modal behaviors. Across six simulated and six real tasks, EquiBot demonstrates superior data efficiency and robustness to out-of-distribution scenarios compared with vanilla diffusion policies and prior equivariant approaches. The work advances practical, sample-efficient robotic learning, with implications for real-world deployment using minimal human demonstrations.

Abstract

Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning. We propose Equibot, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Our approach combines SIM(3)-equivariant neural network architectures with diffusion models. This ensures that our learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning such as multi-modality and robustness. We show on a suite of 6 simulation tasks that our proposed method reduces the data requirements and improves generalization to novel scenarios. In the real world, with 10 variations of 6 mobile manipulation tasks, we show that our method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.
Paper Structure (25 sections, 2 theorems, 8 equations, 13 figures, 2 tables)

This paper contains 25 sections, 2 theorems, 8 equations, 13 figures, 2 tables.

Key Result

Proposition 1

Let $p(x^K|c)$ be an SO(3)-equivariant density function conditioned on $c$, i.e. $\forall {\mathbf{R}} \in SO(3), p(x^K|c) = p({\mathbf{R}} x^K| {\mathbf{R}} c)$. If the Markov transitions $p(x^{k - 1}|x^{k}, c)$ are SO(3)-equivariant for all $k$, i.e. $p(x^{k - 1}|x^{k}, c) = p({\mathbf{R}} x^{k -

Figures (13)

  • Figure 1: We propose a method for learning generalizable and sample-efficient visuomotor policies that can be applied to everyday manipulation tasks.
  • Figure 2: Method overview. Given input scene point cloud & robot pose, our method performs a series of diffusion steps to obtain denoised actions with SIM(3)-equivariance, i.e. when the inputs translate, rotate, and scale, the outputs are guaranteed to translate, rotate, and scale accordingly.
  • Figure 3: Visualizations of simulation environments. The three mobile manipulation tasks feature varied rigid, deformable, and articulated objects. The Push T task features multi-modal demonstration data that challenge the learning algorithms. The Can and Square tasks from the Robomimic benchmark require precise position and orientation movements to successfully complete the tasks.
  • Figure 4: Results of out-of-distribution generalization experiments. We show that our method achieves more robust out-of-distribution generalization performance than methods that do not use diffusion processes to model policies and ones that do not utilize equivariance. Error bars show the mean and standard deviation over 5 checkpoints and 3 seeds.
  • Figure 5: Results of data efficiency experiments. Our method achieves better data efficiency than the Diffusion Policy when evaluated in distribution on two benchmark tasks.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof