Table of Contents
Fetching ...

Factored Latent Action World Models

Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone

TL;DR

The paper addresses learning controllable world models from action-free video in scenes with multiple independently acting entities. It introduces FLAM, a factored latent action model that decomposes the state into $K$ slots and latent actions into per-slot variables, each of dimension $d/K$ within a shared latent action space, enabling scalable modeling of joint dynamics. FLAM uses a pretrained VQ-VAE encoder, a Slot Attention–based factorizer with causal temporal attention, and shared IDM/FDM modules; an Aggregator reconstructs $z_{t+1}$ from predicted slots, trained with the objective $\mathcal{L}_\text{LAM} = \|z_{t+1}-\hat{z}_{t+1}\|^2_2 + \beta \sum_{i=1}^K D_{KL}[q(a^i_t) \| p(a^i_t)]$. Empirical results on four simulated datasets and nuPlan show FLAM yields improved world-model accuracy, clearer factor-entity correspondences, and more sample-efficient policy learning via pseudo-labels. The work demonstrates that factorized latent actions reduce the combinatorial action space and enhance generalization in multi-entity environments.

Abstract

Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

Factored Latent Action World Models

TL;DR

The paper addresses learning controllable world models from action-free video in scenes with multiple independently acting entities. It introduces FLAM, a factored latent action model that decomposes the state into slots and latent actions into per-slot variables, each of dimension within a shared latent action space, enabling scalable modeling of joint dynamics. FLAM uses a pretrained VQ-VAE encoder, a Slot Attention–based factorizer with causal temporal attention, and shared IDM/FDM modules; an Aggregator reconstructs from predicted slots, trained with the objective . Empirical results on four simulated datasets and nuPlan show FLAM yields improved world-model accuracy, clearer factor-entity correspondences, and more sample-efficient policy learning via pseudo-labels. The work demonstrates that factorized latent actions reduce the combinatorial action space and enhance generalization in multi-entity environments.

Abstract

Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.
Paper Structure (33 sections, 12 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 12 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: In multi-entity scenarios, (left) such as an intersection with three road users: (middle) a vanilla latent action model encodes the scene change with a single latent action of dimension $d$, which makes learning challenging as this latent action space needs to model all $|\mathcal{A}|^K$ joint action combinations. (right) In contrast, FLAM decomposes the state into $K$ factors, each with its own latent action of dimension $\frac{d}{K}$. Additionally, we assume all latent actions share the same space (i.e., with the same prior / codebook), which reduces the learning problem to modeling the $|\mathcal{A}|$ actions per factor rather than their $|\mathcal{A}|^K$ joint combinations.
  • Figure 2: Latent action model.
  • Figure 3: Two training stages of FLAM. (a) A VQ-VAE is pretrained to extract features for latent action model learning. (b) FLAM infers latent actions and makes predictions for each factor independently, with all modules trained jointly to minimize the prediction error.
  • Figure 4: Prediction performance variation along with increasing number of entities in the scene.
  • Figure 5: UMAP projection of the learned latent actions on the MultiGrid dataset. Each point corresponds to a latent action inferred by the IDM for one factor on a single transition from the observation-only dataset. Points are colored by the ground-truth action taken by the corresponding agent at that transition (action labels are used only for visualization and are not used for training).
  • ...and 9 more figures