Table of Contents
Fetching ...

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

Ola Shorinwa, Johnathan Tucker, Aliyah Smith, Aiden Swann, Timothy Chen, Roya Firoozi, Monroe Kennedy, Mac Schwager

TL;DR

Splat-MOVER advances open-vocabulary robotic manipulation by embedding semantic and grasp-affordance knowledge directly into a 3D Gaussian Splat scene (ASK-Splat), enabling real-time scene editing (SEE-Splat) and affordance-aware grasp generation (Grasp-Splat) for multi-stage tasks. The approach combines a brief RGB-based scanning phase with CLIP and affordance distillation to create a dynamic digital twin that reflects object motions across stages, improving planning and execution without human demonstrations. Hardware experiments on a Kinova robot show significant gains over two recent baselines across single- and multi-stage tasks, demonstrating tangible benefits of scene editing and affordance grounding for robust manipulation. The work suggests a practical path toward scalable, language-guided robotics by blending lightweight semantic distillation with fast, editable 3D representations, while outlining limitations related to generalization and extending affordances to SE(3).

Abstract

We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills semantic and grasp affordance features into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical in many robotics tasks; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose affordance-aligned candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks and in four multi-stage manipulation tasks, using the edited scene to reflect changes due to prior manipulation stages, which is not possible with existing baselines. Video demonstrations and the code for the project are available at https://splatmover.github.io.

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

TL;DR

Splat-MOVER advances open-vocabulary robotic manipulation by embedding semantic and grasp-affordance knowledge directly into a 3D Gaussian Splat scene (ASK-Splat), enabling real-time scene editing (SEE-Splat) and affordance-aware grasp generation (Grasp-Splat) for multi-stage tasks. The approach combines a brief RGB-based scanning phase with CLIP and affordance distillation to create a dynamic digital twin that reflects object motions across stages, improving planning and execution without human demonstrations. Hardware experiments on a Kinova robot show significant gains over two recent baselines across single- and multi-stage tasks, demonstrating tangible benefits of scene editing and affordance grounding for robust manipulation. The work suggests a practical path toward scalable, language-guided robotics by blending lightweight semantic distillation with fast, editable 3D representations, while outlining limitations related to generalization and extending affordances to SE(3).

Abstract

We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills semantic and grasp affordance features into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical in many robotics tasks; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose affordance-aligned candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks and in four multi-stage manipulation tasks, using the edited scene to reflect changes due to prior manipulation stages, which is not possible with existing baselines. Video demonstrations and the code for the project are available at https://splatmover.github.io.
Paper Structure (29 sections, 3 equations, 14 figures, 2 tables)

This paper contains 29 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Splat-MOVER enables language-guided, multi-stage robotic manipulation, through an affordance-and-semantic-aware scene representation (ASK-Splat), a real-time scene-editing module (SEE-Splat), and a grasp-generation module (Grasp-Splat).
  • Figure 2: ASK-Splat grounds $2$D visual attributes (e.g, color and lighting effects), grasp affordance, and semantic embeddings within a $3$D GSplat representation and is trained entirely from RGB images. Using $3$D ASK-Splat, SEE-Splat enables open-vocabulary scene-editing via semantic localization of Gaussian primitives in the scene, followed by $3$D masking and transformation ${\xi(t)}$ of these Gaussians.
  • Figure 3: Scene-editing introduces artifacts, e.g., holes in the table (center) after moving the saucepan, which are removed via $3$D Gaussian infilling (right).
  • Figure 4: The top-two grasps proposed by (left) GraspNet and (right) Grasp-Splat for a saucepan.
  • Figure 5: Grasp-Splat generates affordance-aligned grasps using the semantic and grasp-affordance knowledge in ASK-Splat. Qualitatively, the proposed grasps lie in regions where a human is more likely to grasp (e.g., the handle of the saucepot and the center of the fruit) and are more likely to result in stable grasps, when executed by a robot.
  • ...and 9 more figures