Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting
Ola Shorinwa, Johnathan Tucker, Aliyah Smith, Aiden Swann, Timothy Chen, Roya Firoozi, Monroe Kennedy, Mac Schwager
TL;DR
Splat-MOVER advances open-vocabulary robotic manipulation by embedding semantic and grasp-affordance knowledge directly into a 3D Gaussian Splat scene (ASK-Splat), enabling real-time scene editing (SEE-Splat) and affordance-aware grasp generation (Grasp-Splat) for multi-stage tasks. The approach combines a brief RGB-based scanning phase with CLIP and affordance distillation to create a dynamic digital twin that reflects object motions across stages, improving planning and execution without human demonstrations. Hardware experiments on a Kinova robot show significant gains over two recent baselines across single- and multi-stage tasks, demonstrating tangible benefits of scene editing and affordance grounding for robust manipulation. The work suggests a practical path toward scalable, language-guided robotics by blending lightweight semantic distillation with fast, editable 3D representations, while outlining limitations related to generalization and extending affordances to SE(3).
Abstract
We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) ASK-Splat, a GSplat representation that distills semantic and grasp affordance features into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical in many robotics tasks; (ii) SEE-Splat, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) Grasp-Splat, a grasp generation module that uses ASK-Splat and SEE-Splat to propose affordance-aligned candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks and in four multi-stage manipulation tasks, using the edited scene to reflect changes due to prior manipulation stages, which is not possible with existing baselines. Video demonstrations and the code for the project are available at https://splatmover.github.io.
