Table of Contents
Fetching ...

GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

Yuhang Zheng, Xiangyu Chen, Yupeng Zheng, Songen Gu, Runyi Yang, Bu Jin, Pengfei Li, Chengliang Zhong, Zengmao Wang, Lina Liu, Chao Yang, Dawei Wang, Zhen Chen, Xiaoxiao Long, Meiqing Wang

TL;DR

Open-world robotic manipulation requires grounding language in 3D scenes with dynamic objects. The authors introduce GaussianGrasper, which uses 3D Gaussian Splatting to explicitly model scenes, and EFD to distill open-vocabulary language into a compact 3D feature field. A normal-guided grasp module and a scene-updating mechanism enable language-guided localization, feasible grasp generation, and rapid scene updates from few views. Real-world experiments demonstrate improved localization accuracy, faster language queries, robust geometry reconstruction, and higher manipulation success compared to baselines, highlighting practical impact for open-world robotics.

Abstract

Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. Data and codes can be available at https://github.com/MrSecant/GaussianGrasper.

GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

TL;DR

Open-world robotic manipulation requires grounding language in 3D scenes with dynamic objects. The authors introduce GaussianGrasper, which uses 3D Gaussian Splatting to explicitly model scenes, and EFD to distill open-vocabulary language into a compact 3D feature field. A normal-guided grasp module and a scene-updating mechanism enable language-guided localization, feasible grasp generation, and rapid scene updates from few views. Real-world experiments demonstrate improved localization accuracy, faster language queries, robust geometry reconstruction, and higher manipulation success compared to baselines, highlighting practical impact for open-world robotics.

Abstract

Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. Data and codes can be available at https://github.com/MrSecant/GaussianGrasper.
Paper Structure (28 sections, 9 equations, 6 figures, 3 tables)

This paper contains 28 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present a comparison between our method, 2D feature fusion, and LERF. When given the language query "hamburger", the features extracted by the 2D foundation models exhibit inconsistencies between two viewpoints, and LERF lacks clear segmentation boundaries. Consequently, they both suffer from imprecise 3D localization, as depicted by the yellow and purple 3D bounding boxes. In contrast, our method reconstructs a consistent feature field and achieves more precise 3D localization.
  • Figure 2: The architecture of our proposed method. (a) is our proposed pipeline where we scan multi-view RGBD images for initialization and reconstruct 3D Gaussian field via feature distillation and geometry reconstruction. Subsequently, given a language instruction, we locate the target object via open-vocabulary querying. Grasp pose candidates for grasping the target object are then generated by a pre-trained grasping model. Finally, a normal-guided module that uses surface normal to filter out unfeasible candidates is proposed to select the best grasp pose. (b) elaborates on EFD where we leverage contrastive learning to constrain rendered latent feature $L$ and only sample a few pixels to recover features to the CLIP space via an MLP. Then, the recovered features are used to calculate distillation loss with the CLIP features. (c) shows the normal-guided grasp that utilizes Force-closure theory to filter out unfeasible grasp poses.
  • Figure 3: Relevance map of the given language instructions. Our method exhibits clearer segmentation boundaries compared to LERF, which can be used to obtain more accurate localization. Compared with SAM + CLIP, our approach exhibits more consistent open-vocabulary features across multi-views. For instance, in 'Roasted chicken wing', the response of SAM + CLIP is the chicken wing and the fork while our method makes the correct response.
  • Figure 4: Compared with scanned depth and surface normal, our rendered depth and surface normal is smoother. Our method renders accurate depth and surface normal even in areas where the ground truth is invalid.
  • Figure 5: Effectiveness of our proposed normal-guided grasp. The left column shows the top 5 grasp proposals provided by AnyGrasp. The redder the color, the higher the grasping score. The middle column displays the surface normal of the object, with purple arrows indicating the normal of the contact points. The right column demonstrates the successful execution of grasping the knife utilizing the final grasp pose after filtering out unreasonable proposals.
  • ...and 1 more figures