Factored Latent Action World Models
Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone
TL;DR
The paper addresses learning controllable world models from action-free video in scenes with multiple independently acting entities. It introduces FLAM, a factored latent action model that decomposes the state into $K$ slots and latent actions into per-slot variables, each of dimension $d/K$ within a shared latent action space, enabling scalable modeling of joint dynamics. FLAM uses a pretrained VQ-VAE encoder, a Slot Attention–based factorizer with causal temporal attention, and shared IDM/FDM modules; an Aggregator reconstructs $z_{t+1}$ from predicted slots, trained with the objective $\mathcal{L}_\text{LAM} = \|z_{t+1}-\hat{z}_{t+1}\|^2_2 + \beta \sum_{i=1}^K D_{KL}[q(a^i_t) \| p(a^i_t)]$. Empirical results on four simulated datasets and nuPlan show FLAM yields improved world-model accuracy, clearer factor-entity correspondences, and more sample-efficient policy learning via pseudo-labels. The work demonstrates that factorized latent actions reduce the combinatorial action space and enhance generalization in multi-entity environments.
Abstract
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.
