Table of Contents
Fetching ...

Improving Intrinsic Exploration by Creating Stationary Objectives

Roger Creus Castanyer, Joshua Romoff, Glen Berseth

TL;DR

This paper tackles the non-stationarity inherent in many intrinsic exploration rewards by showing that augmenting the MDP state with sufficient statistics $\phi_t$ can convert these objectives into stationary ones. The proposed Stationary Objectives For Exploration (SOFE) framework unifies count-based, pseudo-count, and state-entropy maximization rewards under a single stationary formulation, enabling end-to-end training with a single policy. Empirical results across reward-free, sparse-reward, and high-dimensional tasks—including 3D navigation, procedurally generated environments, and pixel-based observations—demonstrate that SOFE consistently improves exploration and often outperforms prior stabilization approaches like DeRL. The approach scales to large state-action spaces and is compatible with multiple RL algorithms, indicating broad practical impact for robust exploration in complex environments.

Abstract

Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Several exploration objectives like count-based bonuses, pseudo-counts, and state-entropy maximization are non-stationary and hence are difficult to optimize for the agent. While this issue is generally known, it is usually omitted and solutions remain under-explored. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. We show that SOFE improves the performance of several exploration objectives, including count-based bonuses, pseudo-counts, and state-entropy maximization. Moreover, SOFE outperforms prior methods that attempt to stabilize the optimization of intrinsic objectives. We demonstrate the efficacy of SOFE in hard-exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.

Improving Intrinsic Exploration by Creating Stationary Objectives

TL;DR

This paper tackles the non-stationarity inherent in many intrinsic exploration rewards by showing that augmenting the MDP state with sufficient statistics can convert these objectives into stationary ones. The proposed Stationary Objectives For Exploration (SOFE) framework unifies count-based, pseudo-count, and state-entropy maximization rewards under a single stationary formulation, enabling end-to-end training with a single policy. Empirical results across reward-free, sparse-reward, and high-dimensional tasks—including 3D navigation, procedurally generated environments, and pixel-based observations—demonstrate that SOFE consistently improves exploration and often outperforms prior stabilization approaches like DeRL. The approach scales to large state-action spaces and is compatible with multiple RL algorithms, indicating broad practical impact for robust exploration in complex environments.

Abstract

Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Several exploration objectives like count-based bonuses, pseudo-counts, and state-entropy maximization are non-stationary and hence are difficult to optimize for the agent. While this issue is generally known, it is usually omitted and solutions remain under-explored. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. We show that SOFE improves the performance of several exploration objectives, including count-based bonuses, pseudo-counts, and state-entropy maximization. Moreover, SOFE outperforms prior methods that attempt to stabilize the optimization of intrinsic objectives. We demonstrate the efficacy of SOFE in hard-exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.
Paper Structure (35 sections, 4 equations, 40 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 4 equations, 40 figures, 4 tables, 1 algorithm.

Figures (40)

  • Figure 1: SOFE enables agents to observe the sufficient statistics of the intrinsic rewards and use them for decision-making.
  • Figure 2: We use 3 mazes and a large 3D map to evaluate both goal-reaching and purely exploratory behaviors. Maze 1: a fully connected, hard-exploration maze; Maze 2: a maze with open spaces and a goal; Maze 3: same as Maze 1 but with 3 doors which an intelligent agent should use for more efficient exploration; 3D map: a large map with continuous state and action spaces.
  • Figure 3: Episodic state-visitation for A2C agents during training. The first row represents SOFE, which uses both the count-based rewards and state augmentation (+ C. + Aug.), and the second row represents training with the count-based rewards only (+ C.). Although optimizing for the same reward distribution, our method achieves better exploration performance.
  • Figure 4: Map coverage achieved by SAC agents in a complex 3D map. Blue curves represent agents that use count-based rewards (+ C.); Red curves represent SOFE, which uses both count-based rewards and the state augmentations from SOFE (+ C. + Aug.). Even though we use the same learning objective, SOFE facilitates its optimization and achieves better exploration. Shaded areas represent one standard deviation. Results are averaged from 6 seeds.
  • Figure 5: Episodic state coverage achieved by S-Max (blue) and SOFE S-Max (red) in Maze 2. When augmented with SOFE, agents better optimize for the state-entropy maximization objective. Shaded areas represent one standard deviation. Results are averaged from 6 seeds.
  • ...and 35 more figures