Table of Contents
Fetching ...

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held

TL;DR

The Latent Particle World Model is introduced, a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making and readily applicable to decision-making, including goal-conditioned imitation learning.

Abstract

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

TL;DR

The Latent Particle World Model is introduced, a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making and readily applicable to decision-making, including goal-conditioned imitation learning.

Abstract

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web
Paper Structure (41 sections, 36 equations, 19 figures, 11 tables)

This paper contains 41 sections, 36 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Self-supervised object-centric world modeling with LPWM. Top: latent particle decomposition. Bottom left: language-conditioned video generation. Bottom right: latent-action-conditioned video prediction.
  • Figure 2: Representation discrepancy. Text is typically tokenized into semantically meaningful units such as words or subwords, whereas image representations are most often constructed by dividing the image into a fixed grid of patches (“patchifying”) that do not explicitly encode semantic content.
  • Figure 3: Latent Particle World Model architecture. Left: Input frames are encoded into particle sets by the Encoder and decoded back to images by the Decoder. The Context module then processes the particles to sample latent actions, which are combined with the particles in the Dynamics module to predict next-step particle states. Right: The Context module models the per-particle latent action distribution. During training, we use the latent inverse dynamics head, while at inference, the latent policy is employed for sampling.
  • Figure 4: LPWM generated goal-conditioned imagined trajectories (top) and actual environment executions (bottom) through a learned mapping to actions on OGBench-Scene. The imagined trajectories closely match the actual executions, demonstrating LPWM's predictive accuracy.
  • Figure 5: Spatial-softmax. Given a heatmap $\tilde{\mathcal{H}} \in \mathbb{R}^{H \times W}$, the softmax function is applied over the spatial dimensions to normalize $\tilde{\mathcal{H}}$ into a probability distribution $\mathcal{H}$. These values are then used to compute the expected coordinate values for each axis, and their covariance.
  • ...and 14 more figures