Table of Contents
Fetching ...

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

Shizhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

TL;DR

The paper addresses language-guided robotic manipulation by leveraging 3D point clouds, overcoming the limitations of 2D image-based approaches. It introduces PolarNet, which uses a PointNext-based point cloud encoder, a CLIP language encoder, and a multimodal transformer to fuse modalities and predict 7-DoF actions via a heatmap-based position regression with per-point offsets, plus rotation and gripper state. Comprehensive RLBench experiments across single-task and multi-task settings show state-of-the-art performance and data efficiency, with additional real-robot demonstrations indicating practical applicability. Ablation studies highlight the importance of color, multi-view fusion, and careful point removal, and PolarNet demonstrates favorable training efficiency compared to voxel-based 3D methods.

Abstract

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot.

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

TL;DR

The paper addresses language-guided robotic manipulation by leveraging 3D point clouds, overcoming the limitations of 2D image-based approaches. It introduces PolarNet, which uses a PointNext-based point cloud encoder, a CLIP language encoder, and a multimodal transformer to fuse modalities and predict 7-DoF actions via a heatmap-based position regression with per-point offsets, plus rotation and gripper state. Comprehensive RLBench experiments across single-task and multi-task settings show state-of-the-art performance and data efficiency, with additional real-robot demonstrations indicating practical applicability. Ablation studies highlight the importance of color, multi-view fusion, and careful point removal, and PolarNet demonstrates favorable training efficiency compared to voxel-based 3D methods.

Abstract

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot.
Paper Structure (19 sections, 6 equations, 13 figures, 13 tables)

This paper contains 19 sections, 6 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: (a): Variations of the "Reach and Drag" task in RLBench james20rlbench, with different target colors per variation. (b): Although different views are complementary to represent the scene, they are not explicitly aligned with each other. (c) Merging multi-view cameras to construct a unified point cloud in 3D space. Design options of the point cloud input are carefully investigated in this work.
  • Figure 2: PolarNet for language-guided manipulation. The approach takes as input the merged point cloud obtained from multi-view RGB-D images and a language instruction, and uses PointNext qian22pointnext for efficient point cloud encoding and CLIP text encoder radford21clip for language encoding. The point cloud and language are integrated via a multi-layer transformer at the intermediate level. PolarNet predicts the position (cyan node) using an integral over the heatmap of the point cloud with offset per point, and also rotation and open state of the gripper using global features.
  • Figure 3: Comparison of different models on the robustness to camera perturbation, where the lower the better.
  • Figure 4: Examples of the selected 10 tasks with corresponding instructions in RLBench.
  • Figure 5: Illustration of point cloud processing. We represent the raw point cloud, point cloud with background removal and point cloud with background and table removal for put knife task.
  • ...and 8 more figures