Learning Interactive World Model for Object-Centric Reinforcement Learning
Fan Feng, Phillip Lippe, Sara Magliacane
TL;DR
This work introduces FIOC-WM, a two-level factored, object-centric world model that jointly learns per-object static and dynamic attributes and sparse inter-object interactions from pixel inputs. A hierarchical policy then leverages the learned interaction primitives to solve long-horizon tasks, with a high-level planner sequencing interaction graphs and a low-level controller executing them via MPC or PPO. Empirical results across SpritesWorld, Fetch, Franka Kitchen, i-Gibson, and Libero show improved world-model fidelity, more accurate interaction discovery, and faster, more generalizable policy learning, particularly under attribute, compositional, and skill generalization. The approach highlights explicit, modular interaction modeling as a key inductive bias for robust control, while noting limitations related to pretrained object-centric features and simulation-based evaluation.
Abstract
Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.
