Table of Contents
Fetching ...

Relational Object-Centric Actor-Critic

Leonid Ugadiarov, Vitaliy Vorobyov, Aleksandr I. Panov

TL;DR

This work introduces Relational Object-Centric Actor-Critic (ROCA), an off-policy, value-based model-based RL algorithm that embeds a graph-based, object-centric world model inside the critic. ROCA uses a pre-trained SLATE object encoder to produce slot representations $z_t$ that feed a graph neural network for transition, reward, and value prediction, enabling planning-like reasoning within an SAC-style framework. Empirical results on 3D CausalWorld Object Reaching and 2D Shapes2D tasks show that ROCA achieves superior sample efficiency and performance in challenging multi-object scenarios, outperforming object-centric model-free baselines and DreamerV3 variants; ROCA-CSWM highlights challenges when integrating contrastive world-model training online. The work demonstrates that graph-based object-centric dynamics can effectively support policy learning, with limitations including deterministic dynamics and entropy sensitivity, and points to future work replacing SLATE with more powerful slot-based encoders to handle more visually complex environments.

Abstract

The advances in unsupervised object-centric representation learning have significantly improved its application to downstream tasks. Recent works highlight that disentangled object representations can aid policy learning in image-based, object-centric reinforcement learning tasks. This paper proposes a novel object-centric reinforcement learning algorithm that integrates actor-critic and model-based approaches by incorporating an object-centric world model within the critic. The world model captures the environment's data-generating process by predicting the next state and reward given the current state-action pair, where actions are interventions in the environment. In model-based reinforcement learning, world model learning can be interpreted as a causal induction problem, where the agent must learn the causal relationships underlying the environment's dynamics. We evaluate our method in a simulated 3D robotic environment and a 2D environment with compositional structure. As baselines, we compare against object-centric, model-free actor-critic algorithms and a state-of-the-art monolithic model-based algorithm. While the baselines show comparable performance in easier tasks, our approach outperforms them in more challenging scenarios with a large number of objects or more complex dynamics.

Relational Object-Centric Actor-Critic

TL;DR

This work introduces Relational Object-Centric Actor-Critic (ROCA), an off-policy, value-based model-based RL algorithm that embeds a graph-based, object-centric world model inside the critic. ROCA uses a pre-trained SLATE object encoder to produce slot representations that feed a graph neural network for transition, reward, and value prediction, enabling planning-like reasoning within an SAC-style framework. Empirical results on 3D CausalWorld Object Reaching and 2D Shapes2D tasks show that ROCA achieves superior sample efficiency and performance in challenging multi-object scenarios, outperforming object-centric model-free baselines and DreamerV3 variants; ROCA-CSWM highlights challenges when integrating contrastive world-model training online. The work demonstrates that graph-based object-centric dynamics can effectively support policy learning, with limitations including deterministic dynamics and entropy sensitivity, and points to future work replacing SLATE with more powerful slot-based encoders to handle more visually complex environments.

Abstract

The advances in unsupervised object-centric representation learning have significantly improved its application to downstream tasks. Recent works highlight that disentangled object representations can aid policy learning in image-based, object-centric reinforcement learning tasks. This paper proposes a novel object-centric reinforcement learning algorithm that integrates actor-critic and model-based approaches by incorporating an object-centric world model within the critic. The world model captures the environment's data-generating process by predicting the next state and reward given the current state-action pair, where actions are interventions in the environment. In model-based reinforcement learning, world model learning can be interpreted as a causal induction problem, where the agent must learn the causal relationships underlying the environment's dynamics. We evaluate our method in a simulated 3D robotic environment and a 2D environment with compositional structure. As baselines, we compare against object-centric, model-free actor-critic algorithms and a state-of-the-art monolithic model-based algorithm. While the baselines show comparable performance in easier tasks, our approach outperforms them in more challenging scenarios with a large number of objects or more complex dynamics.
Paper Structure (44 sections, 15 equations, 12 figures, 4 tables)

This paper contains 44 sections, 15 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: A high-level overview of the proposed method. ROCA learns the policy by extracting object-centric representations from the source image and treating them as a complete graph.
  • Figure 2: ROCA overview. Framework consists of a pre-trained frozen SLATE model, which extracts object-centric representations from an image-based observation, and GNN-based modules: a transition model, a reward model, a state-value model, and an actor model. The transition and reward models form a world model. The world model and the state-value model together constitute the critic module, which predicts Q-values.
  • Figure 3: Return averaged over 30 episodes and three seeds for ROCA, ROCA-CSWM, DreamerV3, DreamerV3:pretrained, DreamerV3:slate, OCRL, OC-CA, OC-SA, and OCARL. ROCA learns faster or achieves higher metrics than the baselines. Shaded areas indicate standard deviation.
  • Figure 4: Return averaged over 30 episodes and three seeds for ROCA, ROCA-CSWM, DreamerV3, DreamerV3:slate, OCRL, OC-SA, OC-CA, and OCARL models in the Navigation 10x10 task. ROCA exhibits better performance than baselines but still does not solve the task. Shaded areas indicate standard deviation.
  • Figure 5: Ablation study. SAC-CNN --- a version of SAC with a standard CNN encoder. SAC-SLATE --- a version of SAC with a pretrained SLATE encoder which averages object emebeddings to obtain the embedding of the current state. SAC-WM-SLATE --- a modification of SAC-SLATE which uses a monolithic world-model in its critic. SAC-GNN-SLATE --- an object-centric version of SAC with a pretrained SLATE encoder which uses GNNs as actor and critic. ROCA (no-tuning) --- a version of ROCA without target entropy tuning. ROCA (no action in edge) --- a version of ROCA, in which edge functions do not take an action as input. ROCA (no action in node) --- a version of ROCA, in which node functions do not take an action as input. ROCA and ROCA (no action in node) demonstrate similar performance. ROCA outperforms the other considered baselines. Return averaged over 30 episodes and three seeds. Shaded areas indicate standard deviation.
  • ...and 7 more figures