Table of Contents
Fetching ...

RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation

Chongkai Gao, Zhengrong Xue, Shuying Deng, Tianhai Liang, Siqi Yang, Lin Shao, Huazhe Xu

TL;DR

RiEMann is presented, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input that directly predicts the target poses of objects for manipulation without any object segmentation.

Abstract

We present RiEMann, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input. Compared to previous methods that rely on descriptor field matching, RiEMann directly predicts the target poses of objects for manipulation without any object segmentation. RiEMann learns a manipulation task from scratch with 5 to 10 demonstrations, generalizes to unseen SE(3) transformations and instances of target objects, resists visual interference of distracting objects, and follows the near real-time pose change of the target object. The scalable action space of RiEMann facilitates the addition of custom equivariant actions such as the direction of turning the faucet, which makes articulated object manipulation possible for RiEMann. In simulation and real-world 6-DOF robot manipulation experiments, we test RiEMann on 5 categories of manipulation tasks with a total of 25 variants and show that RiEMann outperforms baselines in both task success rates and SE(3) geodesic distance errors on predicted poses (reduced by 68.6%), and achieves a 5.4 frames per second (FPS) network inference speed. Code and video results are available at https://riemann-web.github.io/.

RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation

TL;DR

RiEMann is presented, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input that directly predicts the target poses of objects for manipulation without any object segmentation.

Abstract

We present RiEMann, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input. Compared to previous methods that rely on descriptor field matching, RiEMann directly predicts the target poses of objects for manipulation without any object segmentation. RiEMann learns a manipulation task from scratch with 5 to 10 demonstrations, generalizes to unseen SE(3) transformations and instances of target objects, resists visual interference of distracting objects, and follows the near real-time pose change of the target object. The scalable action space of RiEMann facilitates the addition of custom equivariant actions such as the direction of turning the faucet, which makes articulated object manipulation possible for RiEMann. In simulation and real-world 6-DOF robot manipulation experiments, we test RiEMann on 5 categories of manipulation tasks with a total of 25 variants and show that RiEMann outperforms baselines in both task success rates and SE(3) geodesic distance errors on predicted poses (reduced by 68.6%), and achieves a 5.4 frames per second (FPS) network inference speed. Code and video results are available at https://riemann-web.github.io/.
Paper Structure (38 sections, 10 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 38 sections, 10 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of RiEMann. (a) Given 5 to 10 demonstrations of restricted object poses (The mug remains standing and only rotates in 90 degrees around the z-axis) of the task Mug on Rack and (b) with the full scene point could as input without segmentation, (c) RiEMann can generalize to local SE(3)-equivariant transformations of target objects, to new instances of target objects, be robust to distracting objects, and (d) has the near real-time following ability of target objects.
  • Figure 2: Illustration of different 3D rotation representations under type-$l$ parameterizations. (a) The initial point cloud $x_1$ is transformed to $x_2$ with 3D rotation $R$; (b) Using three type-$1$ vectors to represent a Rotation Matrix. It transforms with the same transformation $R$ as the input; (c) (d) Using three type-$0$ vectors to represent Euler Angles, and using one type-$1$ vector and one type-$0$ vector to represent Axis-angle. They cannot be transformed with the same transformation $R$ as the input, so they are not SE(3)-equivariant parameterizations.
  • Figure 3: Pipeline of RiEMann. A type-$0$ saliency map is firstly outputted by an SE(3)-invariant backbone $\phi$ to get a small point cloud region $\mathbf{B}_{ROI}$, and an SE(3)-equivariant policy network that contains $\psi_1$ and $\psi_2$ predicts the action vector fields on the points of $\mathbf{B}_{ROI}$. Finally, we perform softmax, region mean pooling, and IMGS to get the target action $\mathbf{T}$.
  • Figure 4: Simulation and real-world environments. Top: the training environment settings of simulation tasks and real world tasks. Bottom: the ALL testing case of tasks, where the target object is a new instance and in a new pose, and distracting objects are added in the environments.
  • Figure 5: Test pose predictions and feature visualization of the real-world task mug-on-rack.
  • ...and 4 more figures