Table of Contents
Fetching ...

Deep Reinforcement Learning Agents are not even close to Human Intelligence

Quentin Delfosse, Jannis Blüml, Fabian Tatai, Théo Vincent, Bjarne Gregori, Elisabeth Dillies, Jan Peters, Constantin Rothkopf, Kristian Kersting

TL;DR

This paper argues that current deep and symbolic RL agents struggle to generalize to simplified task variations, revealing a persistent reliance on shortcuts rather than truly understanding task structure. It introduces HackAtari, a RAM-based variation suite for the Arcade Learning Environment, to systematically test human-like generalization and detect misalignment. Empirical results show broad performance drops across diverse agents on HackAtari variations, while humans maintain or improve performance, underscoring the gap to human-like intelligence. The work advocates for benchmarks that stress relational reasoning and the incorporation of human inductive biases to drive the development of more robust, aligned RL systems with practical impact.

Abstract

Deep reinforcement learning (RL) agents achieve impressive results in a wide variety of tasks, but they lack zero-shot adaptation capabilities. While most robustness evaluations focus on tasks complexifications, for which human also struggle to maintain performances, no evaluation has been performed on tasks simplifications. To tackle this issue, we introduce HackAtari, a set of task variations of the Arcade Learning Environments. We use it to demonstrate that, contrary to humans, RL agents systematically exhibit huge performance drops on simpler versions of their training tasks, uncovering agents' consistent reliance on shortcuts. Our analysis across multiple algorithms and architectures highlights the persistent gap between RL agents and human behavioral intelligence, underscoring the need for new benchmarks and methodologies that enforce systematic generalization testing beyond static evaluation protocols. Training and testing in the same environment is not enough to obtain agents equipped with human-like intelligence.

Deep Reinforcement Learning Agents are not even close to Human Intelligence

TL;DR

This paper argues that current deep and symbolic RL agents struggle to generalize to simplified task variations, revealing a persistent reliance on shortcuts rather than truly understanding task structure. It introduces HackAtari, a RAM-based variation suite for the Arcade Learning Environment, to systematically test human-like generalization and detect misalignment. Empirical results show broad performance drops across diverse agents on HackAtari variations, while humans maintain or improve performance, underscoring the gap to human-like intelligence. The work advocates for benchmarks that stress relational reasoning and the incorporation of human inductive biases to drive the development of more robust, aligned RL systems with practical impact.

Abstract

Deep reinforcement learning (RL) agents achieve impressive results in a wide variety of tasks, but they lack zero-shot adaptation capabilities. While most robustness evaluations focus on tasks complexifications, for which human also struggle to maintain performances, no evaluation has been performed on tasks simplifications. To tackle this issue, we introduce HackAtari, a set of task variations of the Arcade Learning Environments. We use it to demonstrate that, contrary to humans, RL agents systematically exhibit huge performance drops on simpler versions of their training tasks, uncovering agents' consistent reliance on shortcuts. Our analysis across multiple algorithms and architectures highlights the persistent gap between RL agents and human behavioral intelligence, underscoring the need for new benchmarks and methodologies that enforce systematic generalization testing beyond static evaluation protocols. Training and testing in the same environment is not enough to obtain agents equipped with human-like intelligence.

Paper Structure

This paper contains 67 sections, 2 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: RAM alteration allows for modified environments, here exemplified on Pong. Altering specific RAM cells leads to an enemy remaining static after it returned the ball.
  • Figure 2: Examples of HackAtari simple tasks variations. Top: the original Atari games used to trained RL agents. Bottom: simplifications (i.e. variations for which human performances do not drop). These include color changes and gameplay shifts. Superposed frames show the game dynamics. Descriptions of more environments and their variations are provided in Appendix \ref{['appendix:environments']}.
  • Figure 3: Deep and symbolic RL agents performances drop on HackAtari variations, illustrated by the IQM (following reliable agarwal2021deep) over the human normalized scores (HNS) of various RL agents on a total set of $32$ task variations (over $17$ games). IQMs are computed over $3$ seeded trained agents ($30$ evaluations each). Expert-human scores are borrowed from Badia2020agent57. Performance in the original environment is plotted filled, while the performance in the modified environment is plotted hatched. Raw IQM scores (with CIs) for each agent on each game (original and variations) and extended results are provided in Appendix \ref{['appendix:Q1']} and \ref{['appendix:Q3']}.
  • Figure 4: While humans easily adapt to task simplifications, deep agents' performances drop, illustrated on $15$ ALE games. Non-expert users and deep RL agents are trained and evaluated on the original ALE environment, then presented with a variation of the task. Left: Variations considered as task simplifications by design. Right: Variations for which little or no performance increase is expected. Games for which no C51 agent is publicly available are marked with $\boldsymbol{\times}$. For the exact performances of humans and deep agents, cf. Appendix \ref{['appendix:Q1']} and \ref{['appendix:Q2']}.
  • Figure 5: Object-centric RL agents also fail to adapt to simplified environments. Different object-centric approaches (all using PPO) are here compared to the classical CNN baseline on the same $15$ variations (as Figure \ref{['fig:deep_agents_pc']}). Visual perturbations (e.g. color change, left) have a very limited impact on the symbolic agents, while most gameplay modifications (left) still cannot be solved by these agents. Extended results are available in Appendix \ref{['appendix:Q3']}.
  • ...and 9 more figures