Slot Structured World Models
Jonathan Collu, Riccardo Majellaro, Aske Plaat, Thomas M. Moerland
TL;DR
Slot Structured World Models (SSWM) address the need for robust object-centric scene understanding by embedding an object-focused Slot Attention encoder within a graph-based latent dynamics model. The approach yields more disentangled object representations and improved multi-step prediction over the prior C-SWM baseline, especially in environments requiring complex object interactions. Empirical results on the Interactive Spriteworld benchmark show that SSWM achieves higher accuracy across 1, 5, and 10-step horizons and provides clearer latent-space structure and pixel-space predictions. These findings suggest object-centric, slot-based representations paired with iterative graph dynamics can enhance model-based reasoning and planning in environments with multiple interacting objects; code is available for reproduction.
Abstract
The ability to perceive and reason about individual objects and their interactions is a goal to be achieved for building intelligent artificial systems. State-of-the-art approaches use a feedforward encoder to extract object embeddings and a latent graph neural network to model the interaction between these object embeddings. However, the feedforward encoder can not extract {\it object-centric} representations, nor can it disentangle multiple objects with similar appearance. To solve these issues, we introduce {\it Slot Structured World Models} (SSWM), a class of world models that combines an {\it object-centric} encoder (based on Slot Attention) with a latent graph-based dynamics model. We evaluate our method in the Spriteworld benchmark with simple rules of physical interaction, where Slot Structured World Models consistently outperform baselines on a range of (multi-step) prediction tasks with action-conditional object interactions. All code to reproduce paper experiments is available from \url{https://github.com/JonathanCollu/Slot-Structured-World-Models}.
