When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks
Stefano Ferraro, Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo
TL;DR
The paper addresses whether unsupervised, disentangled object-centric representations can improve policy learning under distribution shifts. It introduces DLPWM, a DDLP-based OCWM that learns per-object latent particles with components for position, depth, scale, transparency, and appearance from pixels, coupled with a dynamics predictor, a particle aggregator, and a DreamerV3–style actor–critic. DLPWM achieves strong reconstruction and prediction and shows robustness to several OOD visual variations, but policies trained on its latents underperform compared with DreamerV3 due to a representation drift that occurs during multi-object interactions. The authors identify contact-induced perturbations and slot-identity drift as key causes of instability and propose an EMA-based slot smoothing strategy as a potential remedy, underscoring that robust visual modeling alone is not sufficient for stable control.
Abstract
Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
