SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Junqiu Yu; Xinlin Ren; Yongchong Gu; Haitao Lin; Tianyu Wang; Yi Zhu; Hang Xu; Yu-Gang Jiang; Xiangyang Xue; Yanwei Fu

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

Junqiu Yu, Xinlin Ren, Yongchong Gu, Haitao Lin, Tianyu Wang, Yi Zhu, Hang Xu, Yu-Gang Jiang, Xiangyang Xue, Yanwei Fu

TL;DR

SparseGrasp tackles open-vocabulary language-guided robotic grasping in changeable environments using sparse-view RGB images. It combines 3D Gaussian Splatting with a dense initialization from DUSt3R, semantic features from MaskCLIP and SAM compressed via PCA, and a render-and-compare strategy for fast scene updates. Grasp generation is performed directly from the 3DGS representation, avoiding voxelization and depth back-projection. Empirical results on a Kinova Gen2 robot show faster scene updates and robust, language-driven grasping compared to state-of-the-art methods.

Abstract

Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency, we repurpose Principal Component Analysis (PCA) to compress features from 2D models. Additionally, we introduce a novel render-and-compare strategy that ensures rapid scene updates, enabling multi-turn grasping in changeable environments. Experimental results show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability, providing a robust solution for multi-turn grasping in changeable environment.

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

TL;DR

Abstract

SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)