PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
Shizhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev
TL;DR
The paper addresses language-guided robotic manipulation by leveraging 3D point clouds, overcoming the limitations of 2D image-based approaches. It introduces PolarNet, which uses a PointNext-based point cloud encoder, a CLIP language encoder, and a multimodal transformer to fuse modalities and predict 7-DoF actions via a heatmap-based position regression with per-point offsets, plus rotation and gripper state. Comprehensive RLBench experiments across single-task and multi-task settings show state-of-the-art performance and data efficiency, with additional real-robot demonstrations indicating practical applicability. Ablation studies highlight the importance of color, multi-view fusion, and careful point removal, and PolarNet demonstrates favorable training efficiency compared to voxel-based 3D methods.
Abstract
The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot.
