Table of Contents
Fetching ...

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

TL;DR

ManiGaussian tackles language-conditioned robotic manipulation in unstructured environments by explicitly modeling scene-level spatiotemporal dynamics. It introduces a dynamic Gaussian Splatting framework to propagate semantic features in a Gaussian embedding space and couples it with a Gaussian world model that reconstructs future scenes for supervision. On RLBench, ManiGaussian achieves higher average success rates than state-of-the-art methods and trains more quickly, demonstrating strong generalization across tasks and variations. The work highlights the value of explicit dynamic scene understanding for robust, goal-directed manipulation under natural language guidance.

Abstract

Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate. Project page: https://guanxinglu.github.io/ManiGaussian/.

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

TL;DR

ManiGaussian tackles language-conditioned robotic manipulation in unstructured environments by explicitly modeling scene-level spatiotemporal dynamics. It introduces a dynamic Gaussian Splatting framework to propagate semantic features in a Gaussian embedding space and couples it with a Gaussian world model that reconstructs future scenes for supervision. On RLBench, ManiGaussian achieves higher average success rates than state-of-the-art methods and trains more quickly, demonstrating strong generalization across tasks and variations. The work highlights the value of explicit dynamic scene understanding for robust, goal-directed manipulation under natural language guidance.

Abstract

Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate. Project page: https://guanxinglu.github.io/ManiGaussian/.
Paper Structure (17 sections, 9 equations, 5 figures, 5 tables)

This paper contains 17 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Consider the human instruction "stack two rose blocks", where the task is considered successful if two rose blocks are stacked upon the green block. The previous method (GNFactor ze2023gnfactor) attempts to pick up the fixed green base but fails severely due to the misunderstanding of the scene dynamics, while our ManiGaussian completes the task successfully by explicitly encoding the scene dynamics via future scene reconstruction in Gaussian embedding space.
  • Figure 2: The overall pipeline of ManiGaussian, which primarily consists of a dynamic Gaussian Splatting framework and a Gaussian world model. The dynamic Gaussian Splatting framework models the propagation of diverse semantic features in the Gaussian embedding space for manipulation, and the Gaussian world model parameterizes distributions to provide supervision by reconstructing the future scene for scene-level dynamics mining.
  • Figure 3: Learning Curve. Comparison of our ManiGaussian with GNFactor in performance and speed. For a fair comparison, we exclude auxiliary losses from the reconstruction loss. The grey dotted lines represent the results using a moving average.
  • Figure 4: Case Study. The red mark signifies the pose deviates severely from the expert demonstration, whereas the green mark indicates that the pose aligns with the expert trajectory. Our ManiGaussian can successfully complete the human goal with the physical understanding of scene-level spatial-temporal dynamics.
  • Figure 5: Novel View Synthesis Results. We remove the action loss here for better visualization. Our ManiGaussian is capable of both current scene reconstruction and future scene prediction.