Table of Contents
Fetching ...

Modeling the Real World with High-Density Visual Particle Dynamics

William F. Whitney, Jacob Varley, Deepali Jain, Krzysztof Choromanski, Sumeet Singh, Vikas Sindhwani

TL;DR

HD-VPD introduces High-Density Visual Particle Dynamics, a world model that learns to predict real-world robotic dynamics by representing scenes as 3D particles and using a novel Interlacer transformer that interleaves linear-attention with local neighborhood attention. Trained on multi-view RGB-D data from bi-manual Kuka robots, it scales to 100K+ particles and demonstrates faster, higher-fidelity predictions than prior GNN-based approaches, while enabling action-conditioned video generation and planning. The work shows HD-VPD’s applicability to downstream tasks such as box pushing and grasping, with rollout-based costs aligning with real outcomes. Together, these contributions push toward scalable, high-fidelity, perception-driven world models for planning and control in complex robotic settings.

Abstract

We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can grasping tasks. See videos and particle dynamics rendered by HD-VPD at https://sites.google.com/view/hd-vpd.

Modeling the Real World with High-Density Visual Particle Dynamics

TL;DR

HD-VPD introduces High-Density Visual Particle Dynamics, a world model that learns to predict real-world robotic dynamics by representing scenes as 3D particles and using a novel Interlacer transformer that interleaves linear-attention with local neighborhood attention. Trained on multi-view RGB-D data from bi-manual Kuka robots, it scales to 100K+ particles and demonstrates faster, higher-fidelity predictions than prior GNN-based approaches, while enabling action-conditioned video generation and planning. The work shows HD-VPD’s applicability to downstream tasks such as box pushing and grasping, with rollout-based costs aligning with real outcomes. Together, these contributions push toward scalable, high-fidelity, perception-driven world models for planning and control in complex robotic settings.

Abstract

We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can grasping tasks. See videos and particle dynamics rendered by HD-VPD at https://sites.google.com/view/hd-vpd.
Paper Structure (19 sections, 2 equations, 17 figures)

This paper contains 19 sections, 2 equations, 17 figures.

Figures (17)

  • Figure 1: HD-VPD can accurately predict the dynamics of complex real-world interactions between robots (here, 16-DoF bi-manual Kukas) and objects/tools. Left: a push-pedal trash-can opening task and Right: a bimanual dustpan sweeping task. Top row: Renders from HD-VPD. The first image in each pair is a reconstruction of the matching input frame, and the second is a prediction several timesteps into the future given a sequence of robot actions. Bottom row: Ground-truth test set video frames.
  • Figure 2: Overview of HD-VPD model with learned Encoders, Dynamics and Render. Encoders encode RGB-D images into a point cloud representation with latent per-point features. Dynamics predicts the evolution of the scene conditioned on the current scene as well as a kinematic skeleton representing the motion of the robot. Renderer is a Point-NeRF style model which enables generation of images of the predicted future scene. The entire model is trained end-to-end with a pixel-wise $L^2$ loss.
  • Figure 3: Interlacer dynamics. The input point clouds from each timestep are processed by the neighbor-attender layers, followed by the Performer layers (see \ref{['sec:performer-pct']} for details). In the HD-VPD model, a separate third channel is reserved for processing kinematic particles describing the actions conducted by the robot, which are preprocessed by a regular PCT layer. Then, all of the preprocessed point clouds are merged. The model predicts particles' displacements as well as deltas of their corresponding feature vectors, after applying one more neighbor-attender and Performer. See \ref{['fig:first_fig']} for how Interlacer is integrated with the HD-VPD model.
  • Figure 4: The anatomy of the Neighbor-Attender layer. A subset of anchor particles are sampled uniformly from the full set of particles, and these anchor particles aggregate information from their $k$ nearest neighbors. Then the full set of particles are updated, each using the features of the closest anchor particle.
  • Figure 5: Analysis of models' behavior as a function of number of particles. Note GNN is unable to be run with 65K or 131K particles due to memory limitations. Interlacer with 131K particles provides the best prediction quality while staying competitive with the GNN baselines for dynamics speed, and with Performer-PCTs for memory requirements. (a) Test set SSIM prediction quality increases with the number of particles, and Interlacer with 131K points does the best. (b) Interlacer is faster than GNN while able to handle many more particles. Performer-PCT is faster, but achieves worse results. (c) Performer-PCT and Interlacer use less memory than the GNN baseline, enabling them to scale to larger point clouds.
  • ...and 12 more figures