Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

Ola Shorinwa; Johnathan Tucker; Aliyah Smith; Aiden Swann; Timothy Chen; Roya Firoozi; Monroe Kennedy; Mac Schwager

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

Ola Shorinwa, Johnathan Tucker, Aliyah Smith, Aiden Swann, Timothy Chen, Roya Firoozi, Monroe Kennedy, Mac Schwager

TL;DR

Splat-MOVER advances open-vocabulary robotic manipulation by embedding semantic and grasp-affordance knowledge directly into a 3D Gaussian Splat scene (ASK-Splat), enabling real-time scene editing (SEE-Splat) and affordance-aware grasp generation (Grasp-Splat) for multi-stage tasks. The approach combines a brief RGB-based scanning phase with CLIP and affordance distillation to create a dynamic digital twin that reflects object motions across stages, improving planning and execution without human demonstrations. Hardware experiments on a Kinova robot show significant gains over two recent baselines across single- and multi-stage tasks, demonstrating tangible benefits of scene editing and affordance grounding for robust manipulation. The work suggests a practical path toward scalable, language-guided robotics by blending lightweight semantic distillation with fast, editable 3D representations, while outlining limitations related to generalization and extending affordances to SE(3).

Abstract

We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills semantic and grasp affordance features into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical in many robotics tasks; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose affordance-aligned candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks and in four multi-stage manipulation tasks, using the edited scene to reflect changes due to prior manipulation stages, which is not possible with existing baselines. Video demonstrations and the code for the project are available at https://splatmover.github.io.

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

TL;DR

Abstract

Paper Structure (29 sections, 3 equations, 14 figures, 2 tables)

This paper contains 29 sections, 3 equations, 14 figures, 2 tables.

Introduction
Preliminaries
Affordance-and-Semantic-Knowledge Gaussian Splatting
Scene-Editing-Enabled Gaussian Splatting
Affordance-Aligned Grasp Generation
Experiments
Related Work
Conclusion
Limitations and Future Work
Affordance-and-Semantic-Knowledge Gaussian Splatting
Grounding Language Semantics in 3D Gaussian Splatting
Grounding Affordance in 3D Gaussian Splatting
Scene-Editing-Enabled Gaussian Splatting
Editing the Gaussians in SEE-Splat
Grasping and Manipulation with Splat-MOVER
...and 14 more sections

Figures (14)

Figure 1: Splat-MOVER enables language-guided, multi-stage robotic manipulation, through an affordance-and-semantic-aware scene representation (ASK-Splat), a real-time scene-editing module (SEE-Splat), and a grasp-generation module (Grasp-Splat).
Figure 2: ASK-Splat grounds $2$D visual attributes (e.g, color and lighting effects), grasp affordance, and semantic embeddings within a $3$D GSplat representation and is trained entirely from RGB images. Using $3$D ASK-Splat, SEE-Splat enables open-vocabulary scene-editing via semantic localization of Gaussian primitives in the scene, followed by $3$D masking and transformation ${\xi(t)}$ of these Gaussians.
Figure 3: Scene-editing introduces artifacts, e.g., holes in the table (center) after moving the saucepan, which are removed via $3$D Gaussian infilling (right).
Figure 4: The top-two grasps proposed by (left) GraspNet and (right) Grasp-Splat for a saucepan.
Figure 5: Grasp-Splat generates affordance-aligned grasps using the semantic and grasp-affordance knowledge in ASK-Splat. Qualitatively, the proposed grasps lie in regions where a human is more likely to grasp (e.g., the handle of the saucepot and the center of the fruit) and are more likely to result in stable grasps, when executed by a robot.
...and 9 more figures

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

TL;DR

Abstract

Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting

Authors

TL;DR

Abstract

Table of Contents

Figures (14)