Table of Contents
Fetching ...

Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos

Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li

TL;DR

Particle-Grid Neural Dynamics addresses deformable object modeling from RGB-D videos by fusing a set of discrete particles with a fixed 3D grid to learn dense motion under robot interactions. The approach employs a neural dynamics function that combines a global feature encoder, a neural velocity field, a grid-based velocity editor, and a grid-to-particle integrator to predict future motion, even with partial observations. It further integrates with 3D Gaussian Splatting to render action-conditioned video predictions and enables model-based planning via MPC. Across ropes, cloth, plush toys, bags, boxes, and bread, the method outperforms physics-based and graph-based baselines in dynamics prediction and planning, and generalizes to unseen instances while enabling high-fidelity 3D rendering. The work advances data-driven, real-world capable digital twins for deformable objects in robotic manipulation.

Abstract

Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .

Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos

TL;DR

Particle-Grid Neural Dynamics addresses deformable object modeling from RGB-D videos by fusing a set of discrete particles with a fixed 3D grid to learn dense motion under robot interactions. The approach employs a neural dynamics function that combines a global feature encoder, a neural velocity field, a grid-based velocity editor, and a grid-to-particle integrator to predict future motion, even with partial observations. It further integrates with 3D Gaussian Splatting to render action-conditioned video predictions and enables model-based planning via MPC. Across ropes, cloth, plush toys, bags, boxes, and bread, the method outperforms physics-based and graph-based baselines in dynamics prediction and planning, and generalizes to unseen instances while enabling high-fidelity 3D rendering. The work advances data-driven, real-world capable digital twins for deformable objects in robotic manipulation.

Abstract

Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .

Paper Structure

This paper contains 55 sections, 23 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Modeling deformable objects from RGB-D videos presents a significant challenge due to occlusions and complex physical interactions. Our Particle-Grid Neural Dynamics framework learns the behavior of deformable objects directly from real-world observations. To train the model, we introduce a novel dense 3D tracking method that leverages foundational vision models for video tracking. The trained model predicts the motion of dense particles under robot-object interactions. We demonstrate the ability of Particle-Grid Neural Dynamics to model complex interactions across a diverse set of objects, including ropes, cloth, plush toy, box, and bread.
  • Figure 2: Overview of proposed framework: Particle-Grid Neural Dynamics.(a) A diagram of our dynamics model. Given particle positions $\mathbf{X}_t$ and velocities $\mathbf{V}_t$ fused from multi-view depth images as input, our model predicts dense per-particle motion by first using a point encoder to extract particle features and predict the velocity field, which is then transformed into a grid representation to estimate the velocity distribution in 3D space. The model updates particle positions $\hat{\mathbf{X}}_{t+\Delta t}$ with the predicted velocities $\hat{\mathbf{V}}_{t+\Delta t}$ to perform iterative rollouts. (b) Our framework enables 3D action-conditioned video prediction by reconstructing objects with 3D Gaussian Splatting and interpolating the 6DoF transformation of Gaussian kernels using the predicted particle motions. (c) The model can be integrated into model-based planning frameworks to generate plausible motions for manipulating deformable objects.
  • Figure 3: Qualitative Comparisons on Dynamics Prediction. Given initial states and actions, we show the prediction results of the GBND baseline compared to our particle-grid neural dynamics model. The red spheres indicate the position and orientation of robot grippers. We overlay the predictions with ground truth final state images to highlight the prediction errors. Our model's predictions are more aligned with the ground truth, offering higher-density particle predictions and fewer artifacts compared to the baseline.
  • Figure 4: Quantitative Comparisons on Prediction under Partial Views. We compare our method with the GBND baseline in the cloth and paper bag categories while varying the number of input camera views. We report the mean and standard deviation of the dynamics prediction error. Our method consistently achieves lower error than the baseline, and its error increase rate as the number of camera views decreases is also lower.
  • Figure 5: Quantitative Comparisons on Generalization. Our method is compared with GBND on seen and unseen instances of the rope and cloth categories. We present the mean and standard deviation of dynamics prediction error. Our method's prediction error is lower on both seen and unseen instances compared to the baseline.
  • ...and 9 more figures