Table of Contents
Fetching ...

Successor-Predecessor Intrinsic Exploration

Changmin Yu, Neil Burgess, Maneesh Sahani, Samuel J. Gershman

TL;DR

This work addresses exploration in reinforcement learning under sparse extrinsic rewards by introducing SPIE, a framework that fuses prospective information from successor representations with retrospective connectivity signals. SPIE provides two instantiations: SR-R for discrete spaces and SF-PF for continuous spaces, aiming to bias exploration toward globally informative states such as bottlenecks. Empirical results across tabular, grid-world, MountainCar, and Atari show that SPIE improves sample efficiency and final performance, with notable gains on hard-exploration games like Montezuma's Revenge. The approach highlights the practical impact of leveraging trajectory structure for intrinsic motivation and opens avenues for theory and extensions to non-stationary and causal exploration settings.

Abstract

Exploration is essential in reinforcement learning, particularly in environments where external rewards are sparse. Here we focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. Although the study of intrinsic rewards has a long history, existing methods focus on composing the intrinsic reward based on measures of future prospects of states, ignoring the information contained in the retrospective structure of transition sequences. Here we argue that the agent can utilise retrospective information to generate explorative behaviour with structure-awareness, facilitating efficient exploration based on global instead of local information. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods. We also implement SPIE in deep reinforcement learning agents, and show that the resulting agent achieves stronger empirical performance than existing methods on sparse-reward Atari games.

Successor-Predecessor Intrinsic Exploration

TL;DR

This work addresses exploration in reinforcement learning under sparse extrinsic rewards by introducing SPIE, a framework that fuses prospective information from successor representations with retrospective connectivity signals. SPIE provides two instantiations: SR-R for discrete spaces and SF-PF for continuous spaces, aiming to bias exploration toward globally informative states such as bottlenecks. Empirical results across tabular, grid-world, MountainCar, and Atari show that SPIE improves sample efficiency and final performance, with notable gains on hard-exploration games like Montezuma's Revenge. The approach highlights the practical impact of leveraging trajectory structure for intrinsic motivation and opens avenues for theory and extensions to non-stationary and causal exploration settings.

Abstract

Exploration is essential in reinforcement learning, particularly in environments where external rewards are sparse. Here we focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. Although the study of intrinsic rewards has a long history, existing methods focus on composing the intrinsic reward based on measures of future prospects of states, ignoring the information contained in the retrospective structure of transition sequences. Here we argue that the agent can utilise retrospective information to generate explorative behaviour with structure-awareness, facilitating efficient exploration based on global instead of local information. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods. We also implement SPIE in deep reinforcement learning agents, and show that the resulting agent achieves stronger empirical performance than existing methods on sparse-reward Atari games.
Paper Structure (17 sections, 1 theorem, 23 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 1 theorem, 23 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Proposition A.1

$\hbox{\boldmath$\mathsf{N}$} \text{diag}(\hbox{\boldmath$\mathsf{z}$}) = \text{diag}(\hbox{\boldmath$\mathsf{z}$}) \hbox{\boldmath$\mathsf{M}$}$, where $\text{diag}(\hbox{\boldmath$\mathsf{z}$})$ is the diagonal matrix with the diagonal elements as the vector $\hbox{\boldmath$\mathsf{z}$}$, and $\h

Figures (7)

  • Figure 1: Evaluation of exploration efficiency in grid worlds. (a) Grid worlds with varying size and complexity. 'S' and 'G' in OF-small and Cluster-hard represents the start and goal states in the goal-oriented reinforcement learning task; colored $G_{1}$ and $G_{2}$ in OF-small and Cluster-hard represent the changed goal locations (see the non-stationary reward experiment in Section \ref{['sec: experiments']}), $s_{\ast}$ in Cluster-simple denote the bottleneck state. (b-c) Accumulated number of states visited against exploration timesteps, for all considered agents in all grid-worlds in with (a) online-learned SR matrix (b) and fixed SR matrix (c). All reported results are averaged over $10$ random seeds (shaded area denotes mean $\pm$ 1 standard error). Hyperparameters can be found in Appendix.
  • Figure 2: Graphical illustration of the neural network architecture of DQN-SF-PF for Atari games. Note that the state feature vector is L2-normalised, $\phi(s) = \frac{\tilde{\phi}(s)}{||\tilde{\phi}(s)||_{2}}$, where $\tilde{\phi}(s)$ is the raw output of the convolutional encoder.
  • Figure 3: Goal-oriented navigation in grid worlds. Evaluations of SARSA, SARSA-SR and SARSA-SRR on OF-small (a) and Cluster-hard (b) grid worlds (Figure \ref{['fig: grids_demo']}) with stationary reward structure, and on OF-small (c) and Cluster-hard (d) with non-stationary reward structures. The red dashed horizontal line represents the shorted path distance. The black dashed vertical lines represent the time point at which the goal change occurs.
  • Figure 4: Evaluation of SPIE with linear features in MountainCar. (a) Graphical demonstration of MountainCar environment; (b); Example random Fourier features; (c) Evaluations of Q-learning with linear function approximation with intrinsic rewards $r_{\text{SF}}$ and $r_{\text{SF-PF}}$ on MountainCar. Reported results are averaged over $10$ random seeds.
  • Figure 5: Discrete MDPs. Transition probabilities are denoted by $\langle\text{action}, \text{probability}, \text{reward}\rangle$. In RiverSwim (a), the agent starts in state 1 or 2. In SixArms (b), the agent starts in state 0.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition A.1
  • proof