Table of Contents
Fetching ...

COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

Nicholas Watters, Loic Matthey, Matko Bosnjak, Christopher P. Burgess, Alexander Lerchner

TL;DR

COBRA tackles data efficiency and robustness in continuous control by combining unsupervised object-centric representation learning, curiosity-driven exploration, and model-based RL in a two-phase pipeline. It learns object slots and dynamics without rewards during exploration, then freezes these components and uses a reward predictor for 1-step model-based planning on downstream tasks. The approach yields strong data efficiency and robustness to task-irrelevant perturbations in Spriteworld, outperforming model-free baselines and demonstrating amortization of pretraining across tasks. This work suggests that structured, object-centric world models plus intrinsic curiosity can enable scalable, robust transfer to diverse control tasks.

Abstract

Data efficiency and robustness to task-irrelevant perturbations are long-standing challenges for deep reinforcement learning algorithms. Here we introduce a modular approach to addressing these challenges in a continuous control environment, without using hand-crafted or supervised information. Our Curious Object-Based seaRch Agent (COBRA) uses task-free intrinsically motivated exploration and unsupervised learning to build object-based models of its environment and action space. Subsequently, it can learn a variety of tasks through model-based search in very few steps and excel on structured hold-out tests of policy robustness.

COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

TL;DR

COBRA tackles data efficiency and robustness in continuous control by combining unsupervised object-centric representation learning, curiosity-driven exploration, and model-based RL in a two-phase pipeline. It learns object slots and dynamics without rewards during exploration, then freezes these components and uses a reward predictor for 1-step model-based planning on downstream tasks. The approach yields strong data efficiency and robustness to task-irrelevant perturbations in Spriteworld, outperforming model-free baselines and demonstrating amortization of pretraining across tasks. This work suggests that structured, object-centric world models plus intrinsic curiosity can enable scalable, robust transfer to diverse control tasks.

Abstract

Data efficiency and robustness to task-irrelevant perturbations are long-standing challenges for deep reinforcement learning algorithms. Here we introduce a modular approach to addressing these challenges in a continuous control environment, without using hand-crafted or supervised information. Our Curious Object-Based seaRch Agent (COBRA) uses task-free intrinsically motivated exploration and unsupervised learning to build object-based models of its environment and action space. Subsequently, it can learn a variety of tasks through model-based search in very few steps and excel on structured hold-out tests of policy robustness.

Paper Structure

This paper contains 29 sections, 3 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: COBRA model schematic A. Entire model. The vision module (scene encoder and decoder), transition model, and exploration policy are all trained in a pure exploration phase with no reward. B. Transition model architecture. An action-conditioned slot-wise MLP learns one-step future-prediction. This is trained by applying the scene decoder to $\tilde{\boldsymbol{z}}_{t+1}$, through which gradients from a pixel loss are passed. An auxiliary transition error prediction provides a more direct path to the pixel loss and makes adversarial training with the exploration policy more efficient. C. Adversarial training of transition model and exploration policy through which the behavior of moving objects emerges.
  • Figure 2: Scene decomposition, transition model, and exploration policy results A. Vision module (MONet) decomposing Spriteworld scenes into objects (one column per sample scene). (First row) Data samples. (Second row) Reconstruction of the full scene. (Other rows) Individual reconstructions of each slot in the learned scene representation. Some slots are decoded as blank images by the decoder. B. Rollouts of transition model, treated as an RNN, compared to ground truth on two scenes. In each scene, one single item (indicated by dotted circle) is being moved along a line in multiple steps. C. Exploration policy. (Left) Position click component of random samples from the trained exploration policy, which learns to click on (and hence move) objects. (Middle) Slice through the first two dimensions (position click) of the action sampler's quantile function, showing deformations applied on a grid of first clicks $\in [0, 1]^2$ with randomized second click. (Right) Slice through second two dimensions (motion click). There is virtually no deformation, indicating the exploration policy learns to sample motions randomly.
  • Figure 3: Random policy and exploration policy Observations and actions taken by an agent during the unsupervised exploratory phase. Actions are shown with small green arrows. (Top) Random agent, which rarely moves any object, provides a bad source of data for the transition model. (Bottom) trained exploration policy, which frequently moves objects, provides a good source of data for the transition model.
  • Figure 4: Performance, Robustness, and Data Efficiency A. Performance and robustness of agents after training until convergence. Top row shows test-time performance of agents on random environments sampled from the training distribution (higher is better). Bottom row shows robustness tests to out-of-distribution task-irrelevant environment perturbations (see main text for details). B. Data efficiency (lower is better). Computed as smallest number of on-task environment steps needed to reach and sustain $90\%$ average performance over 30 consecutive episodes. The corresponding number of episodes varies depending on task and agent performance, but for COBRA ranges from $\sim 15$ (Goal finding new shapes) to $\sim 600$ (sorting). Gray bars indicate no agent reached $90\%$ performance.
  • Figure 5: COBRA solving our tasks Demonstration of a trained COBRA agent solving different tasks and its behaviour on robustness tests. Agent actions are shown with white arrows, and target goals are shown with crossed circles. These targets are only shown for visualization purposes and are not provided to the agent. See Appendix \ref{['S:supp_agent_videos']} for links to videos. Only five steps are displayed for each episode. (Top) Solving a "Goal finding" task. Having been trained only with a single distractor, COBRA is robust to the addition of a second distractor at test time. (Middle) Solving a "Sorting by color" task. COBRA has been trained to bring objects to different goals depending on their colours, seeing all pairs of colors except (blue, red). It is robust to testing on this held out combination of objects and successfully brings them to their targets. (Bottom) Solving a "Cluster by color" task. Having been trained only on clustering green/blue objects, COBRA successfully extrapolates its reward predictor, and hence its policy, to clustering red/yellow objects.
  • ...and 6 more figures