Table of Contents
Fetching ...

When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

Stefano Ferraro, Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo

TL;DR

The paper addresses whether unsupervised, disentangled object-centric representations can improve policy learning under distribution shifts. It introduces DLPWM, a DDLP-based OCWM that learns per-object latent particles with components for position, depth, scale, transparency, and appearance from pixels, coupled with a dynamics predictor, a particle aggregator, and a DreamerV3–style actor–critic. DLPWM achieves strong reconstruction and prediction and shows robustness to several OOD visual variations, but policies trained on its latents underperform compared with DreamerV3 due to a representation drift that occurs during multi-object interactions. The authors identify contact-induced perturbations and slot-identity drift as key causes of instability and propose an EMA-based slot smoothing strategy as a potential remedy, underscoring that robust visual modeling alone is not sufficient for stable control.

Abstract

Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.

When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

TL;DR

The paper addresses whether unsupervised, disentangled object-centric representations can improve policy learning under distribution shifts. It introduces DLPWM, a DDLP-based OCWM that learns per-object latent particles with components for position, depth, scale, transparency, and appearance from pixels, coupled with a dynamics predictor, a particle aggregator, and a DreamerV3–style actor–critic. DLPWM achieves strong reconstruction and prediction and shows robustness to several OOD visual variations, but policies trained on its latents underperform compared with DreamerV3 due to a representation drift that occurs during multi-object interactions. The authors identify contact-induced perturbations and slot-identity drift as key causes of instability and propose an EMA-based slot smoothing strategy as a potential remedy, underscoring that robust visual modeling alone is not sufficient for stable control.

Abstract

Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Per episode reward over training steps. Trained on cube lift task. Policy trained with DLPWM, are tested with both GNN and transformer (TF) particle aggregator. During the policy training phase, world model and policy update occur every 10 steps. 2 seed for each run are considered.
  • Figure 2: Latent variation with respect to contact point frames. The horizontal dashed-line represent the frame where contact between the robotic arm and the target object is established. Visualized are the position latent $z_p$, scale latent $z_s$ and the visual features latent $z_f$. Results are averaged over 10 evaluation episodes where a total of 39 contact points are identified.
  • Figure 3: All the shape and color combinations present in the Generalization Arena task. On the left, 7 combinations used for training and one the right 2 combinations used for evaluation.
  • Figure 4: Architecture of DLPWM.
  • Figure 5: Reconstruction examples for DreamerV3 and DLPWM. Object masks are from DLPWM.
  • ...and 3 more figures