Table of Contents
Fetching ...

Learning Interactive World Model for Object-Centric Reinforcement Learning

Fan Feng, Phillip Lippe, Sara Magliacane

TL;DR

This work introduces FIOC-WM, a two-level factored, object-centric world model that jointly learns per-object static and dynamic attributes and sparse inter-object interactions from pixel inputs. A hierarchical policy then leverages the learned interaction primitives to solve long-horizon tasks, with a high-level planner sequencing interaction graphs and a low-level controller executing them via MPC or PPO. Empirical results across SpritesWorld, Fetch, Franka Kitchen, i-Gibson, and Libero show improved world-model fidelity, more accurate interaction discovery, and faster, more generalizable policy learning, particularly under attribute, compositional, and skill generalization. The approach highlights explicit, modular interaction modeling as a key inductive bias for robust control, while noting limitations related to pretrained object-centric features and simulation-based evaluation.

Abstract

Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.

Learning Interactive World Model for Object-Centric Reinforcement Learning

TL;DR

This work introduces FIOC-WM, a two-level factored, object-centric world model that jointly learns per-object static and dynamic attributes and sparse inter-object interactions from pixel inputs. A hierarchical policy then leverages the learned interaction primitives to solve long-horizon tasks, with a high-level planner sequencing interaction graphs and a low-level controller executing them via MPC or PPO. Empirical results across SpritesWorld, Fetch, Franka Kitchen, i-Gibson, and Libero show improved world-model fidelity, more accurate interaction discovery, and faster, more generalizable policy learning, particularly under attribute, compositional, and skill generalization. The approach highlights explicit, modular interaction modeling as a key inductive bias for robust control, while noting limitations related to pretrained object-centric features and simulation-based evaluation.

Abstract

Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.

Paper Structure

This paper contains 54 sections, 15 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: The overall pipeline, including offline model learning (left) and online policy learning (right) phases. The illustrative examples are from the Franka-kitchen environment gupta2019relay.
  • Figure 2: An example of a FIOC-POMDP, where we only show the reward for $t+2$ for clarity. Gray nodes are observed variables, while white nodes are latent variables. Each orange box represents the state of an object (in this case, objects $i$ and $j$). Red solid edges are the state transition per object, and dashed edges are the interactions among objects.
  • Figure 3: The pipeline of Offline Model Learning (Stage 1) jointly learns the observation function, state factorization, dynamics model, and reward model. Although Fig. \ref{['fig1']} includes low-level policy learning as part of Stage 1, for clarity, we defer the discussion of low-level policy learning to Stage 2.
  • Figure 4: Visualization of evaluated benchmarks: (a). Sprites-World; (b). OpenAI-Gym Fetch; (c). Franka Kitchen; (d). i-Gibson; and (e). Libero. A larger version is in Table \ref{['app:dataset']}.
  • Figure 5: (a) Evaluation of state factorization in Sprites-world. We report MSE from linear probing to assess the quality of the learned representations against ground-truth attributes.(b) Offline RL performance (success rate) comparison with DINO-WM, including FIOC-DINO and FIOC-R3M.
  • ...and 5 more figures