Table of Contents
Fetching ...

Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

Mingtong Zhang, Kaifeng Zhang, Yunzhu Li

TL;DR

This work introduces a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics, using the 3D Gaussian representation of 3D Gaussian Splatting to train a particle-based dynamics model using Graph Neural Networks.

Abstract

Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.

Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

TL;DR

This work introduces a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics, using the 3D Gaussian representation of 3D Gaussian Splatting to train a particle-based dynamics model using Graph Neural Networks.

Abstract

Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We propose a novel approach for learning a neural dynamics model from real-world data. Using videos captured from robot-object interactions, we obtain dense 3D tracking with a dynamic 3D Gaussian Splatting framework. We train a graph-based neural dynamics model on top of the 3D Gaussian particles for action-conditioned video prediction and model-based planning.
  • Figure 2: Overview of Our Framework: We first achieve dense 3D tracking of long-horizon robot-object interactions using multi-view videos and Dyn3DGS optimization. We then learn the object dynamics through a graph-based neural network. This approach enables applications such as (i) action-conditioned video prediction using linear blend skinning for motion prediction, and (ii) model-based planning for robotics.
  • Figure 3: Qualitative Results of 3D Gaussian Tracking. We demonstrate point-level correspondence on the objects across various timesteps. Please check our https://gaussian-gbnd.github.io/ for more videos showcasing precise dense tracking even under different object deformations and occlusions.
  • Figure 4: Qualitative Results of Action-Conditioned 3D Video Prediction. Our videos are generated by rendering predicted Gaussians on virtual backgrounds. Robot trajectories are visualized as curved lines (yellow: current end-effector positions, purple: history end-effector positions). Compared to the MPM baseline, our video prediction results align with the ground truth frames (GT) more accurately.
  • Figure 5: Quantitative Results of model-based planning. We perform each experiment 5 times and present the results as follows: (i) the median error curve relative to planning steps, with the area between 25 and 75 percentiles shaded, and (ii) the success rate curve relative to error thresholds.