Table of Contents
Fetching ...

Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization

Sebastian Griesbach, Carlo D'Eramo

TL;DR

The paper tackles robust exploration across diverse reward settings by decoupling exploration from exploitation and maximizing the absolute TD-error. It proposes Stable Error-seeking Exploration (SEE), which combines two innovations: (i) mixing exploitation and exploration policies into a single behavior policy using a Boltzmann mixture over relative advantages, (ii) adopting a maximum reward update and (iii) conditioning the exploration value function on the exploitation value via fingerprinting to handle non-stationarity. SEE is instantiated with SAC and TD3 and evaluated across dense, sparse, and exploration-adverse rewards, demonstrating strong robustness without hyperparameter tuning and clear gains in adverse settings. The approach broadens practical applicability of exploration strategies in off-policy RL, with ablations confirming the contribution of each design choice.

Abstract

Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.

Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization

TL;DR

The paper tackles robust exploration across diverse reward settings by decoupling exploration from exploitation and maximizing the absolute TD-error. It proposes Stable Error-seeking Exploration (SEE), which combines two innovations: (i) mixing exploitation and exploration policies into a single behavior policy using a Boltzmann mixture over relative advantages, (ii) adopting a maximum reward update and (iii) conditioning the exploration value function on the exploitation value via fingerprinting to handle non-stationarity. SEE is instantiated with SAC and TD3 and evaluated across dense, sparse, and exploration-adverse rewards, demonstrating strong robustness without hyperparameter tuning and clear gains in adverse settings. The approach broadens practical applicability of exploration strategies in off-policy RL, with ablations confirming the contribution of each design choice.

Abstract

Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Example of a mixed behavior policy of two deterministic policies $\pi_1, \pi_2$ going towards their respective goals. An action moves a maximum distance of $0.05$ on a plane of the size of $2 \times 2$. The respective action-value functions assume a reward of $1$ and termination at their target position, $0$ elsewhere. We use a discount factor of $0.9$.
  • Figure 2: Fingerprinting $\phi$ is used to condition a value function $Q^\omega$ on another value function $Q^\theta$.
  • Figure 3: Comparing SAC+SEE and TD3+SEE to their respective base algorithms across multiple environments in different reward settings. The plots show the average evaluation return across $20$ seeds per environment variant. The shaded regions indicate the standard error.
  • Figure 4: Comparing SAC+SEE to ablations where one of the design choices is replaced. The graphs show the average evaluation return across $10$ seeds per environment variant. The shaded regions indicate the standard error.
  • Figure 5: The graphs show the normalized aggregated average returns grouped by reward setting for all ablations. The normalization assigns the value $0$ to the single worst run in an environment variant and $1$ to the single best run.
  • ...and 3 more figures