Beyond Surprise: Improving Exploration Through Surprise Novelty

Hung Le; Kien Do; Dung Nguyen; Svetha Venkatesh

Beyond Surprise: Improving Exploration Through Surprise Novelty

Hung Le, Kien Do, Dung Nguyen, Svetha Venkatesh

TL;DR

The paper introduces Surprise Memory (SM), a memory-augmented framework for intrinsic motivation that measures surprise novelty rather than surprise magnitude. By combining an episodic memory with an autoencoder, SM retrieves past surprise patterns to produce a robust intrinsic reward that remains focused on genuinely novel events, even in noisy or stochastic environments. Across Noisy-TV, MiniGrid, and Atari benchmarks, SG+SM consistently improves exploration efficiency and final performance, with ablations confirming the necessity of both memory components. The approach offers a scalable, plug-in improvement for existing surprise-based predictors and points to broader implications for memory-based exploration in reinforcement learning.

Abstract

We present a new computing model for intrinsic rewards in reinforcement learning that addresses the limitations of existing surprise-driven explorations. The reward is the novelty of the surprise rather than the surprise norm. We estimate the surprise novelty as retrieval errors of a memory network wherein the memory stores and reconstructs surprises. Our surprise memory (SM) augments the capability of surprise-based intrinsic motivators, maintaining the agent's interest in exciting exploration while reducing unwanted attraction to unpredictable or noisy observations. Our experiments demonstrate that the SM combined with various surprise predictors exhibits efficient exploring behaviors and significantly boosts the final performance in sparse reward environments, including Noisy-TV, navigation and challenging Atari games.

Beyond Surprise: Improving Exploration Through Surprise Novelty

TL;DR

Abstract

Paper Structure (26 sections, 1 theorem, 19 equations, 12 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 1 theorem, 19 equations, 12 figures, 5 tables, 2 algorithms.

Introduction
Methods
Surprise Novelty
Surprise Generator
Surprise Memory
Experimental Results
Noisy-TV: Robustness against Noisy Observations
MiniGrid: Compatibility with Different Surprise Generators
Atari: Sample-efficient Benchmark
Ablation Study
Related works
Discussion
$\mathcal{W}$ as Associative Memory
SM's Implementation Detail
Intrinsic Reward Normalization
...and 11 more sections

Key Result

Proposition 1

Let $X$ and $U$ be random variables representing the observation and surprise at the same timestep, respectively. Under an imperfect SG, the following inequality holds: where $\left(\sigma_{i}^{X}\right)^{2}$ and $\left(\sigma_{i}^{U}\right)^{2}$denote the $i$-th diagonal elements of $\mathrm{{var}}(X)$ and $\mathrm{{var}}(U),$ respectively.

Figures (12)

Figure 1: Montezuma Revenge: surprise novelty better reflects the originality of the environment than surprise norm. While surprise norm can be significant even for dull events such as those in the dark room due to unpredictability, surprise novelty tends to be less ($3^{rd}$ and $6^{th}$ image). On the other hand, surprise novelty can be higher in truly vivid states on the first visit to the ladder and island rooms ($1^{st}$ and $2^{nd}$ image) and reduced on the second visit ($4^{th}$ and $5^{th}$ image). Here, surprise novelty and surprise norm are quantified and averaged over steps in each room.
Figure 2: Surprise Generator+Surprise Memory (SG+SM). The SG takes input $I_{t}$ from the environment to estimate the surprise $u_{t}$ at state $s_{t}$. The SM consists of two modules: an episodic memory ($\mathcal{M}$) and an autoencoder network ($\mathcal{W}$). $\mathcal{M}$ is slot-based, storing past surprises within the episode. At timestep $t$, given surprise $u_{t}$, $\mathcal{M}$ retrieves read-out $u_{t}^{e}$ to form a query surprise $q_{t}=\left[u_{t}^{e},u_{t}\right]$ to $\mathcal{W}$. $\mathcal{W}$ reconstructs the query and takes the reconstruction error (surprise novelty) as the intrinsic reward $r_{t}^{i}$.
Figure 3: Noisy-TV: (a) mean-normalized intrinsic reward (MNIR) produced by RND and RND+SM at 7 selected steps in an episode. (b) Average task return (mean$\pm$std. over 5 runs) over 4 million training steps.
Figure 4: Key-Door: (a) Example map in Key-Door where the light window is the agent's view window (state). MNIR produced for each cell in a manually created trajectory for RND+SM (b) and RND (c). The green arrows denote the agent's direction at each location. The brighter the cell, the higher MNIR assigned to the corresponding state.
Figure 5: (a,b) Atari long runs over 200 million frames: average return over 128 episodes. (c) Ablation study on SM's components. (d) MiniWorld exploration without task reward: Cumulative task returns over 100 million training steps for the hard setting. The learning curves are mean$\pm$std. over 5 runs.
...and 7 more figures

Theorems & Definitions (1)

Proposition 1

Beyond Surprise: Improving Exploration Through Surprise Novelty

TL;DR

Abstract

Beyond Surprise: Improving Exploration Through Surprise Novelty

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (1)