Table of Contents
Fetching ...

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Junqiu Yu, Xinlin Ren, Yongchong Gu, Haitao Lin, Tianyu Wang, Yi Zhu, Hang Xu, Yu-Gang Jiang, Xiangyang Xue, Yanwei Fu

TL;DR

SparseGrasp tackles open-vocabulary language-guided robotic grasping in changeable environments using sparse-view RGB images. It combines 3D Gaussian Splatting with a dense initialization from DUSt3R, semantic features from MaskCLIP and SAM compressed via PCA, and a render-and-compare strategy for fast scene updates. Grasp generation is performed directly from the 3DGS representation, avoiding voxelization and depth back-projection. Empirical results on a Kinova Gen2 robot show faster scene updates and robust, language-driven grasping compared to state-of-the-art methods.

Abstract

Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency, we repurpose Principal Component Analysis (PCA) to compress features from 2D models. Additionally, we introduce a novel render-and-compare strategy that ensures rapid scene updates, enabling multi-turn grasping in changeable environments. Experimental results show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability, providing a robust solution for multi-turn grasping in changeable environment.

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

TL;DR

SparseGrasp tackles open-vocabulary language-guided robotic grasping in changeable environments using sparse-view RGB images. It combines 3D Gaussian Splatting with a dense initialization from DUSt3R, semantic features from MaskCLIP and SAM compressed via PCA, and a render-and-compare strategy for fast scene updates. Grasp generation is performed directly from the 3DGS representation, avoiding voxelization and depth back-projection. Empirical results on a Kinova Gen2 robot show faster scene updates and robust, language-driven grasping compared to state-of-the-art methods.

Abstract

Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency, we repurpose Principal Component Analysis (PCA) to compress features from 2D models. Additionally, we introduce a novel render-and-compare strategy that ensures rapid scene updates, enabling multi-turn grasping in changeable environments. Experimental results show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability, providing a robust solution for multi-turn grasping in changeable environment.

Paper Structure

This paper contains 12 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: We present a comparison between our SparseGrasp and F3RM under both sparse and dense view settings. The top row shows the novel view images, while the bottom row displays the heat map of feature field using the text "whisk" as a query. Remarkably, our method, utilizing only 3 view images, achieves performance comparable to F3RM, which is trained with 17 views.
  • Figure 2: Our architecture. It starts with collecting sparse view images and generating dense point clouds to initialize 3DGS. Next, we integrate FastSAM and MaskCLIP to generate average features within each mask. Then, PCA is applied to compress the whole average features in a low dimension, then distilled into 3DGS. Given an open-vocabulary language instruction, our system can locate the target object and generate appropriate grasp poses. When scene changes, the Render-and-Compare strategy enables fast scene updates.
  • Figure 3: Comparison of dense point initialization v.s. sparse point initialization in sparse view images. Initializing with sparse points often leads to overfitting with sparse view images.
  • Figure 4: Effectiveness of our grasp model: Unlike original GraspNet, which failed to generate grasp poses using 3DGS's centers, our model successfully generates grasp poses.
  • Figure 5: Qualitative results of reconstruction and semantic distillation results.
  • ...and 4 more figures