Table of Contents
Fetching ...

Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation

Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, Jiangmiao Pang

TL;DR

The paper tackles robustness gaps in visuomotor imitation by expanding beyond 2D data augmentation. It presents RoboSplat, a 3D Gaussian Splatting–based pipeline that reconstructs scenes from a single expert demonstration and generates diverse, realistic demonstrations across six generalization axes. Through real-world experiments, it demonstrates substantial improvements in one-shot generalization, achieving 87.8% average success vs. 57.2% for a real-data baseline, and shows strong data efficiency and cross-embodiment transfer. Limitations include handling deformable objects and integrating physical constraints, suggesting future work in physics-informed 3D Gaussians.

Abstract

Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from imprecise physical simulation caused by inaccurate geometric reconstruction. This paper introduces RoboSplat, a novel method that generates diverse, visually realistic demonstrations by directly manipulating 3D Gaussians. Specifically, we reconstruct the scene through 3D Gaussian Splatting (3DGS), directly edit the reconstructed scene, and augment data across six types of generalization with five techniques: 3D Gaussian replacement for varying object types, scene appearance, and robot embodiments; equivariant transformations for different object poses; visual attribute editing for various lighting conditions; novel view synthesis for new camera perspectives; and 3D content generation for diverse object types. Comprehensive real-world experiments demonstrate that RoboSplat significantly enhances the generalization of visuomotor policies under diverse disturbances. Notably, while policies trained on hundreds of real-world demonstrations with additional 2D data augmentation achieve an average success rate of 57.2%, RoboSplat attains 87.8% in one-shot settings across six types of generalization in the real world.

Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation

TL;DR

The paper tackles robustness gaps in visuomotor imitation by expanding beyond 2D data augmentation. It presents RoboSplat, a 3D Gaussian Splatting–based pipeline that reconstructs scenes from a single expert demonstration and generates diverse, realistic demonstrations across six generalization axes. Through real-world experiments, it demonstrates substantial improvements in one-shot generalization, achieving 87.8% average success vs. 57.2% for a real-data baseline, and shows strong data efficiency and cross-embodiment transfer. Limitations include handling deformable objects and integrating physical constraints, suggesting future work in physics-informed 3D Gaussians.

Abstract

Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from imprecise physical simulation caused by inaccurate geometric reconstruction. This paper introduces RoboSplat, a novel method that generates diverse, visually realistic demonstrations by directly manipulating 3D Gaussians. Specifically, we reconstruct the scene through 3D Gaussian Splatting (3DGS), directly edit the reconstructed scene, and augment data across six types of generalization with five techniques: 3D Gaussian replacement for varying object types, scene appearance, and robot embodiments; equivariant transformations for different object poses; visual attribute editing for various lighting conditions; novel view synthesis for new camera perspectives; and 3D content generation for diverse object types. Comprehensive real-world experiments demonstrate that RoboSplat significantly enhances the generalization of visuomotor policies under diverse disturbances. Notably, while policies trained on hundreds of real-world demonstrations with additional 2D data augmentation achieve an average success rate of 57.2%, RoboSplat attains 87.8% in one-shot settings across six types of generalization in the real world.

Paper Structure

This paper contains 35 sections, 7 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Starting from a single expert demonstration and multi-view images, our method generates diverse and visually realistic data for policy learning, enabling robust performance across six types of generalization in the real world. Compared to previous 2D data augmentation methods, our approach achieves significantly better results across various generalization types. Notably, we achieve this within a unified framework.
  • Figure 2: Method overview. We start from a single manually collected demonstration and multi-view images that capture the whole scene. The former provides task-related keyframes, while the latter helps scene reconstruction. After aligning the reconstructed frame with the real-world frame and segmenting different scene components, we carry out autonomous editing of the scene in pursuit of six types of augmentation.
  • Figure 3: Comparison of frame alignment results between ICP and fine-grained optimization with differentiable rendering. The semi-transparent orange overlay represents the ground truth rendered with URDF from the same camera view. The left shows the results of ICP, which have larger errors, while the right shows the results after further fine-grained optimization using differentiable rendering.
  • Figure 4: Illustration of frame alignment with differentiable rendering. The loss is calculated between the mask rendered using Gaussian Splatting and the mask rendered with URDF. Subsequently, backpropagation and gradient descent are used to optimize the translation, rotation, and scale, which are then applied to the 3D Gaussians.
  • Figure 5: Real-world experiment setup. We employ a Franka Research 3 Robot and two eye-on-base RealSense D435i cameras.
  • ...and 10 more figures