Table of Contents
Fetching ...

SUGAR: Pre-training 3D Visual Representations for Robotics

Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

TL;DR

SUGAR presents a 3D pre-training framework for robotics that leverages a transformer-based architecture to learn semantic, geometric, and affordance properties from cluttered multi-object 3D point clouds. It jointly trains five self-supervised tasks—masked point modeling, cross-modal knowledge distillation, grasping pose synthesis, 3D instance segmentation, and referring expression grounding—on both single- and multi-object data, with curriculum learning guiding progression from simple to complex scenes. The learned 3D representations improve zero-shot 3D object recognition, referring expression grounding, and language-guided robotic manipulation, outperforming state-of-the-art 2D and 3D pre-trained models and demonstrating robustness in cluttered real-world scenarios. By emphasizing clutter and affordances, SUGAR offers practical improvements for robotics, including better manipulation in complex environments and more sample-efficient policy learning, while also highlighting areas for future efficiency improvements in large-scale 3D pre-training.

Abstract

Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet, prevailing approaches focus on pre-training 2D representations, being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile, 3D representation learning has been limited to single-object understanding. To address these limitations, we introduce a novel 3D pre-training framework for robotics named SUGAR that captures semantic, geometric and affordance properties of objects through 3D point clouds. We underscore the importance of cluttered scenes in 3D representation learning, and automatically construct a multi-object dataset benefiting from cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes. We evaluate our learned representation on three robotic-related tasks, namely, zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation. Experimental results show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.

SUGAR: Pre-training 3D Visual Representations for Robotics

TL;DR

SUGAR presents a 3D pre-training framework for robotics that leverages a transformer-based architecture to learn semantic, geometric, and affordance properties from cluttered multi-object 3D point clouds. It jointly trains five self-supervised tasks—masked point modeling, cross-modal knowledge distillation, grasping pose synthesis, 3D instance segmentation, and referring expression grounding—on both single- and multi-object data, with curriculum learning guiding progression from simple to complex scenes. The learned 3D representations improve zero-shot 3D object recognition, referring expression grounding, and language-guided robotic manipulation, outperforming state-of-the-art 2D and 3D pre-trained models and demonstrating robustness in cluttered real-world scenarios. By emphasizing clutter and affordances, SUGAR offers practical improvements for robotics, including better manipulation in complex environments and more sample-efficient policy learning, while also highlighting areas for future efficiency improvements in large-scale 3D pre-training.

Abstract

Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet, prevailing approaches focus on pre-training 2D representations, being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile, 3D representation learning has been limited to single-object understanding. To address these limitations, we introduce a novel 3D pre-training framework for robotics named SUGAR that captures semantic, geometric and affordance properties of objects through 3D point clouds. We underscore the importance of cluttered scenes in 3D representation learning, and automatically construct a multi-object dataset benefiting from cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes. We evaluate our learned representation on three robotic-related tasks, namely, zero-shot 3D object recognition, referring expression grounding, and language-driven robotic manipulation. Experimental results show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
Paper Structure (19 sections, 4 equations, 9 figures, 8 tables)

This paper contains 19 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: We introduce SUGAR , a pre-training framework for robotic-related tasks, which learns semantic, geometry and affordance on both single- and multi-object scenes.
  • Figure 2: Network architecture of SUGAR. It consists of a point cloud encoder to generate point embeddings and a prompt-based decoder that takes task-specific prompt tokens and layer-wise connections to point embeddings to obtain prompt embeddings.
  • Figure 3: Left: Five pre-training tasks for SUGAR using single- and multi-object scenes. The modules of the same color are shared. Right: The pre-trained point cloud encoder and prompt-based decoder are finetuned on the downstream task of robotic manipulation.
  • Figure 4: Referring expression examples on the OCID-Ref and RoboRefit dataset. The green bounding box is the groundtruth annotation, and the red bounding box is predicted by our SUGAR model. RoboRefit contains natural scenes and noisy depth observations.
  • Figure 5: Performance of training with 10 demonstrations.
  • ...and 4 more figures