SUGAR: Pre-training 3D Visual Representations for Robotics
Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
TL;DR
SUGAR presents a 3D pre-training framework for robotics that leverages a transformer-based architecture to learn semantic, geometric, and affordance properties from cluttered multi-object 3D point clouds. It jointly trains five self-supervised tasks—masked point modeling, cross-modal knowledge distillation, grasping pose synthesis, 3D instance segmentation, and referring expression grounding—on both single- and multi-object data, with curriculum learning guiding progression from simple to complex scenes. The learned 3D representations improve zero-shot 3D object recognition, referring expression grounding, and language-guided robotic manipulation, outperforming state-of-the-art 2D and 3D pre-trained models and demonstrating robustness in cluttered real-world scenarios. By emphasizing clutter and affordances, SUGAR offers practical improvements for robotics, including better manipulation in complex environments and more sample-efficient policy learning, while also highlighting areas for future efficiency improvements in large-scale 3D pre-training.
Abstract
Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet, prevailing approaches focus on pre-training 2D representations, being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile, 3D representation learning has been limited to single-object understanding. To address these limitations, we introduce a novel 3D pre-training framework for robotics named SUGAR that captures semantic, geometric and affordance properties of objects through 3D point clouds. We underscore the importance of cluttered scenes in 3D representation learning, and automatically construct a multi-object dataset benefiting from cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes. We evaluate our learned representation on three robotic-related tasks, namely, zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation. Experimental results show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
