Modeling the Real World with High-Density Visual Particle Dynamics
William F. Whitney, Jacob Varley, Deepali Jain, Krzysztof Choromanski, Sumeet Singh, Vikas Sindhwani
TL;DR
HD-VPD introduces High-Density Visual Particle Dynamics, a world model that learns to predict real-world robotic dynamics by representing scenes as 3D particles and using a novel Interlacer transformer that interleaves linear-attention with local neighborhood attention. Trained on multi-view RGB-D data from bi-manual Kuka robots, it scales to 100K+ particles and demonstrates faster, higher-fidelity predictions than prior GNN-based approaches, while enabling action-conditioned video generation and planning. The work shows HD-VPD’s applicability to downstream tasks such as box pushing and grasping, with rollout-based costs aligning with real outcomes. Together, these contributions push toward scalable, high-fidelity, perception-driven world models for planning and control in complex robotic settings.
Abstract
We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can grasping tasks. See videos and particle dynamics rendered by HD-VPD at https://sites.google.com/view/hd-vpd.
