Table of Contents
Fetching ...

GraspSplats: Efficient Manipulation with 3D Feature Splatting

Mazeyu Ji, Ri-Zhao Qiu, Xueyan Zou, Xiaolong Wang

TL;DR

GraspSplats addresses the challenge of zero-shot, part-level grasping in dynamic environments by replacing implicit NeRF representations with explicit, feature-enhanced 3D Gaussians learned through depth supervision. The method efficiently constructs the scene, enables open-vocabulary object/part querying, and performs real-time tracking and edits to handle object displacement, achieving fast grasp proposals directly on Gaussian primitives. Its key contributions are (i) an efficient, depth-regularized 3D Gaussian construction with hierarchical reference features, (ii) native part-level querying and sampling for grasping, and (iii) real-time tracking and partial scene re-training to support dynamic manipulation. Experiments on a Franka robot show GraspSplats outperforms NeRF-based and 2D methods in both speed and accuracy, demonstrating practical potential for dynamic, articulated object interaction in robotics.

Abstract

The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.

GraspSplats: Efficient Manipulation with 3D Feature Splatting

TL;DR

GraspSplats addresses the challenge of zero-shot, part-level grasping in dynamic environments by replacing implicit NeRF representations with explicit, feature-enhanced 3D Gaussians learned through depth supervision. The method efficiently constructs the scene, enables open-vocabulary object/part querying, and performs real-time tracking and edits to handle object displacement, achieving fast grasp proposals directly on Gaussian primitives. Its key contributions are (i) an efficient, depth-regularized 3D Gaussian construction with hierarchical reference features, (ii) native part-level querying and sampling for grasping, and (iii) real-time tracking and partial scene re-training to support dynamic manipulation. Experiments on a Franka robot show GraspSplats outperforms NeRF-based and 2D methods in both speed and accuracy, demonstrating practical potential for dynamic, articulated object interaction in robotics.

Abstract

The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.
Paper Structure (25 sections, 8 equations, 4 figures, 6 tables)

This paper contains 25 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: GraspSplats supports diverse robotics tasks using feature-enhanced 3D Gaussians. Compared to existing NeRF-based methods shen2023-F3RMrashid2023-LERFTOGO, GraspSplats transforms the feature representation to reflect object motions in real-time with point tracking from one or more cameras, which makes it possible to perform zero-shot dynamic and articulated object manipulation by parts.
  • Figure 2: GraspSplats employs two techniques to efficiently construct feature-enhanced 3D Gaussians: hierarchical feature extraction and dense initialization from geometry regularization, which reduces the overall runtime to 1/10 of existing GS methods qiu2024-featuresplatting. (High-dimensional features are visualized using PCA and the visualized Gaussian ellipsoids are trained without densification).
  • Figure 3: Given an initial state of Gaussians and RGB-D observations from one or more cameras, GraspSplats tracks the 3D motion of objects specified via language, which is used to deform the Gaussian representations in real-time. Given object-part text pairs, GraspSplats proposes grasping poses using both semantics and geometry of Gaussian primitives in milliseconds.
  • Figure 4: Qualtative examples of GraspSplats performing zero-shot task execution in real-world environments. Given object-part text queries (italicized in the task description), GraspSplats executes grasping followed by heuristic trajectories. From left to right each row: illustration of scene change; grasp poses sampled by GraspSplats; execution of grasping. Animated visualization can be found on the website.