Table of Contents
Fetching ...

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

Chenxi Wang, Hongjie Fang, Hao-Shu Fang, Cewu Lu

TL;DR

RISE addresses real-world robot imitation by learning continuous actions from a single-view, noisy point cloud. It integrates a sparse 3D encoder, sparse positional encoding, a transformer, and a diffusion-based action decoder to produce robust, continuous action trajectories. Evaluations across six real-world tasks with 50 demonstrations per task show that RISE outperforms representative 2D and 3D baselines and generalizes under camera, lighting, and workspace changes. The approach underscores the practical value of 3D perception for end-to-end manipulation and provides a strong, scalable baseline for future research.

Abstract

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

TL;DR

RISE addresses real-world robot imitation by learning continuous actions from a single-view, noisy point cloud. It integrates a sparse 3D encoder, sparse positional encoding, a transformer, and a diffusion-based action decoder to produce robust, continuous action trajectories. Evaluations across six real-world tasks with 50 demonstrations per task show that RISE outperforms representative 2D and 3D baselines and generalizes under camera, lighting, and workspace changes. The approach underscores the practical value of 3D perception for end-to-end manipulation and provides a strong, scalable baseline for future research.

Abstract

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.
Paper Structure (22 sections, 3 equations, 7 figures, 8 tables)

This paper contains 22 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of RISE architecture. The input of RISE is a noisy point cloud captured from the real world. A 3D encoder built with sparse convolution is employed to compress the point cloud into tokens. The tokens are fed into the transformer encoder after adding sparse positional encoding. A readout token is used to query the action features from the transformer decoder. Conditioned on the action features, the Gaussian samples are denoised into continuous actions iteratively using a diffusion head.
  • Figure 2: Definition of the tasks in the experiments. During evaluation, each task is randomly initialized within the robot workspace. For each task, only 3 to 5 setups from the evaluations are depicted in the figure for clarity.
  • Figure 3: Experimental results of the pick-and-place tasks.
  • Figure 4: Failure cases of the Collect Pens task in the experiments.
  • Figure II: Evaluation metrics illustrations (left) and experimental results (right) of the push-to-goal tasks Push Block and Push Ball.
  • ...and 2 more figures