Table of Contents
Fetching ...

Deep Reinforcement Learning via Object-Centric Attention

Jannis Blüml, Cedric Derstroff, Bjarne Gregori, Elisabeth Dillies, Quentin Delfosse, Kristian Kersting

TL;DR

The paper tackles the limited generalization of deep RL agents trained on raw pixels by introducing Object-Centric Attention via Masking (OCCAM), which masks background pixels to preserve task-relevant objects. It leverages simple object bounding boxes to create four abstraction levels (Object, Binary, Class, Planes) and evaluates them on Atari with perturbations to test robustness. Empirical results show OCCAM can match or exceed pixel-based PPO performance and substantially improve resilience to visual perturbations, without requiring domain-specific object pipelines or symbolic reasoning. The findings suggest that structured, object-centric representations can enhance generalization and sample efficiency in RL, offering a practical alternative to fully symbolic or heavily preprocessed approaches.

Abstract

Deep reinforcement learning agents, trained on raw pixel inputs, often fail to generalize beyond their training environments, relying on spurious correlations and irrelevant background details. To address this issue, object-centric agents have recently emerged. However, they require different representations tailored to the task specifications. Contrary to deep agents, no single object-centric architecture can be applied to any environment. Inspired by principles of cognitive science and Occam's Razor, we introduce Object-Centric Attention via Masking (OCCAM), which selectively preserves task-relevant entities while filtering out irrelevant visual information. Specifically, OCCAM takes advantage of the object-centric inductive bias. Empirical evaluations on Atari benchmarks demonstrate that OCCAM significantly improves robustness to novel perturbations and reduces sample complexity while showing similar or improved performance compared to conventional pixel-based RL. These results suggest that structured abstraction can enhance generalization without requiring explicit symbolic representations or domain-specific object extraction pipelines.

Deep Reinforcement Learning via Object-Centric Attention

TL;DR

The paper tackles the limited generalization of deep RL agents trained on raw pixels by introducing Object-Centric Attention via Masking (OCCAM), which masks background pixels to preserve task-relevant objects. It leverages simple object bounding boxes to create four abstraction levels (Object, Binary, Class, Planes) and evaluates them on Atari with perturbations to test robustness. Empirical results show OCCAM can match or exceed pixel-based PPO performance and substantially improve resilience to visual perturbations, without requiring domain-specific object pipelines or symbolic reasoning. The findings suggest that structured, object-centric representations can enhance generalization and sample efficiency in RL, offering a practical alternative to fully symbolic or heavily preprocessed approaches.

Abstract

Deep reinforcement learning agents, trained on raw pixel inputs, often fail to generalize beyond their training environments, relying on spurious correlations and irrelevant background details. To address this issue, object-centric agents have recently emerged. However, they require different representations tailored to the task specifications. Contrary to deep agents, no single object-centric architecture can be applied to any environment. Inspired by principles of cognitive science and Occam's Razor, we introduce Object-Centric Attention via Masking (OCCAM), which selectively preserves task-relevant entities while filtering out irrelevant visual information. Specifically, OCCAM takes advantage of the object-centric inductive bias. Empirical evaluations on Atari benchmarks demonstrate that OCCAM significantly improves robustness to novel perturbations and reduces sample complexity while showing similar or improved performance compared to conventional pixel-based RL. These results suggest that structured abstraction can enhance generalization without requiring explicit symbolic representations or domain-specific object extraction pipelines.

Paper Structure

This paper contains 37 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Deep agents cannot generalize to simpler scenarios. Testing deep agents in simpler or slightly changed versions of the environments reveals a significant performance drop, highlighting a critical limitation in their robustness and adaptability. The reported numbers are the interquartile mean (IQM) over 3 seeds, with 10 runs each on 11 games (C51 on six games). The exact performances are in \ref{['app:results']}.
  • Figure 2: Object-Centric Attention via Masking. An object detection method detects moving objects, allowing it to mask out the irrelevant background details. Optionally, objects can be classified to obtain a class-augmented mask. This representation is then passed to the CNN-based agent.
  • Figure 3: The different extracted representations compared in this paper. For a given frame (left), (a) gray scaling and resizing are applied to reduce computational complexity. On top of it, (b) uses an object extractor to mask out the background information, (c) whitens the bounding boxes, (d) relies on a classifier to assign each object box a given class color, while (e) separates the masks in different planes for each class. Agents are provided with stacks of the latest $4$ extracted representations, except for the semantic vector (f), which extracts symbolic properties of the depicted objects from the last $2$ frames. More examples are provided in \ref{['app:masks']}.
  • Figure 4: OCCAM-based representations match or surpass pixel inputs, showing that abstraction improves performance. This figure compares PPO agents using different input types. OCCAM preserves effectiveness despite filtering details, proving structured inputs can replace raw pixels. Full results are in \ref{['app:results']}.
  • Figure 5: OCCAM-based representations improve robustness to visual perturbations but remain vulnerable to game logic changes. This figure illustrates the relative performance (game normalized score) of PPO agents utilizing different input representations under both (a) visual and (b) game logic perturbations. While OCCAM-based representations significantly mitigate performance degradation due to visual modifications, they remain sensitive to changes in game mechanics, highlighting the limits of abstraction in RL. The game-specific scores are in \ref{['app:results']}.
  • ...and 7 more figures