Table of Contents
Fetching ...

Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

Adriana Hugessen, Roger Creus Castanyer, Faisal Mohamed, Glen Berseth

TL;DR

The paper addresses the limitation that fixed entropy-based intrinsic motivations (surprise-minimization or surprise-maximization) fail to generalize across environments. It introduces a Surprise-Adaptive Bandit (S-Adapt) that online selects between these objectives by measuring the agent's ability to control environmental entropy, using an intrinsic feedback signal grounded in entropy dynamics. Experiments show that S-Adapt can replicate the favorable behaviors of each single-objective agent in appropriate regimes, produce diverse emergent behaviours in benchmarks, and achieve competitive or superior task rewards without extrinsic supervision. The approach provides a versatile framework for unsupervised reinforcement learning that adapts to the entropy landscape of the environment, with potential implications for scalable pretraining and continual learning.

Abstract

Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent's ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviors in benchmark tasks. Videos of the trained agents and summarized findings can be found on our project page https://sites.google.com/view/surprise-adaptive-agents

Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

TL;DR

The paper addresses the limitation that fixed entropy-based intrinsic motivations (surprise-minimization or surprise-maximization) fail to generalize across environments. It introduces a Surprise-Adaptive Bandit (S-Adapt) that online selects between these objectives by measuring the agent's ability to control environmental entropy, using an intrinsic feedback signal grounded in entropy dynamics. Experiments show that S-Adapt can replicate the favorable behaviors of each single-objective agent in appropriate regimes, produce diverse emergent behaviours in benchmarks, and achieve competitive or superior task rewards without extrinsic supervision. The approach provides a versatile framework for unsupervised reinforcement learning that adapts to the entropy landscape of the environment, with potential implications for scalable pretraining and continual learning.

Abstract

Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent's ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviors in benchmark tasks. Videos of the trained agents and summarized findings can be found on our project page https://sites.google.com/view/surprise-adaptive-agents
Paper Structure (22 sections, 6 equations, 9 figures, 1 algorithm)

This paper contains 22 sections, 6 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: The Butterflies (left) and Maze environments (right). S-Min trains the agent to actively catch the butterflies in order to prevent diverse state configurations while at the same time preventing the agent to navigate around Maze. S-Max trains the agent to avoid catching butterflies while navigating the Maze efficiently. These two didactic environments show that current intrinsic objectives fail to provide generally useful objectives for RL agents and cannot adapt.
  • Figure 2: Average episode return (left) and surprise (right) versus environment interactions (average over 5 seeds, with one shaded standard deviation) in the Maze environment. S-Max and S-Adapt are the only objectives that allow the RL agents to consistently find the goal in the maze. These also cause the largest change in surprise when compared to the random agent.
  • Figure 3: Average episode return (left) and surprise (right) versus environment interactions (average over 5 seeds, with one shaded standard deviation) in the Butterflies environment. S-Min, Extrinsic and even the Random agent catch most of the butterflies in the small grid. Because of the small size of the grid, surprise-minimization and surprise-maximization are equally effective in entropy control, and hence the S-Adapt agent converges to S-Max. In the larger grid, however, the Random agent can't catch many butterflies and hence has a high-entropy state distribution. Again, the S-Max agent learns to also avoid catching butterflies and the S-Min agent learns to catch butterflies. However, catching butterflies results in a significant change in the state-marginal entropy in this larger grid. The S-Adapt agent identifies this and converges to S-Min, resulting in agents that catch more than half of the butterflies without access to the extrinsic reward.
  • Figure 4: Average episode return (left) and surprise (right) versus environment interactions (average over 5 seeds, with one shaded standard deviation) in Tetris. S-Min, S-Adapt, and Extrinsic agents solve the game (i.e. consistently survive for more than 200 steps). Interestingly, the surprise-minimizing objective, which S-Adapt converges to, turns out to be a better learning signal than the row-clearing extrinsic reward in Tetris.
  • Figure 5: Average episode return (left) and surprise (right) versus environment interactions (average over 5 seeds, with one shaded standard deviation) in the MinAtar suite of environments. In all environments the S-Adapt agent is able to select the direction for entropy optimization which is most controllable, as demonstrated by the change in entropy from the beginning to the end of training. The S-Adapt agent indeed demonstrates emergent behaviors in certain environments, such as Freeway where it achieves rewards on par with that of the Extrinsic agent. However, in certain environments, like Seaquest, Space Invaders and Asterix, the extrinsic reward is not closely correlated with entropy control, with the Random agent and the Extrinsic agent achieving similar entropy.
  • ...and 4 more figures