Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos
Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li
TL;DR
Particle-Grid Neural Dynamics addresses deformable object modeling from RGB-D videos by fusing a set of discrete particles with a fixed 3D grid to learn dense motion under robot interactions. The approach employs a neural dynamics function that combines a global feature encoder, a neural velocity field, a grid-based velocity editor, and a grid-to-particle integrator to predict future motion, even with partial observations. It further integrates with 3D Gaussian Splatting to render action-conditioned video predictions and enables model-based planning via MPC. Across ropes, cloth, plush toys, bags, boxes, and bread, the method outperforms physics-based and graph-based baselines in dynamics prediction and planning, and generalizes to unseen instances while enabling high-fidelity 3D rendering. The work advances data-driven, real-world capable digital twins for deformable objects in robotic manipulation.
Abstract
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .
