VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts

Marius Captari; Remo Sasso; Matthia Sabatelli

VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts

Marius Captari, Remo Sasso, Matthia Sabatelli

TL;DR

This paper tackles the question of when to explore in deep reinforcement learning by leveraging internal signals rather than blind switching. It introduces VDSC, a unified trigger system that combines Value Promise Discrepancy (VPD) with SimHash-based state counts through a homeostasis mechanism to decide exploration timing. Across Atari experiments with the Rainbow agent, VDSC outperforms traditional methods like $\epsilon$-greedy, Boltzmann, and Noisy Nets, especially in hard-exploration games, with ablations confirming the benefit of combining signals. The approach enhances data efficiency by directly using internal state information to govern exploration without distorting rewards, and opens pathways for extending the idea to policy-gradient methods and learned hashing.

Abstract

Despite the considerable attention given to the questions of \textit{how much} and \textit{how to} explore in deep reinforcement learning, the investigation into \textit{when} to explore remains relatively less researched. While more sophisticated exploration strategies can excel in specific, often sparse reward environments, existing simpler approaches, such as $ε$-greedy, persist in outperforming them across a broader spectrum of domains. The appeal of these simpler strategies lies in their ease of implementation and generality across a wide range of domains. The downside is that these methods are essentially a blind switching mechanism, which completely disregards the agent's internal state. In this paper, we propose to leverage the agent's internal state to decide \textit{when} to explore, addressing the shortcomings of blind switching mechanisms. We present Value Discrepancy and State Counts through homeostasis (VDSC), a novel approach for efficient exploration timing. Experimental results on the Atari suite demonstrate the superiority of our strategy over traditional methods such as $ε$-greedy and Boltzmann, as well as more sophisticated techniques like Noisy Nets.

VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts

TL;DR

-greedy, Boltzmann, and Noisy Nets, especially in hard-exploration games, with ablations confirming the benefit of combining signals. The approach enhances data efficiency by directly using internal state information to govern exploration without distorting rewards, and opens pathways for extending the idea to policy-gradient methods and learned hashing.

Abstract

-greedy, persist in outperforming them across a broader spectrum of domains. The appeal of these simpler strategies lies in their ease of implementation and generality across a wide range of domains. The downside is that these methods are essentially a blind switching mechanism, which completely disregards the agent's internal state. In this paper, we propose to leverage the agent's internal state to decide \textit{when} to explore, addressing the shortcomings of blind switching mechanisms. We present Value Discrepancy and State Counts through homeostasis (VDSC), a novel approach for efficient exploration timing. Experimental results on the Atari suite demonstrate the superiority of our strategy over traditional methods such as

-greedy and Boltzmann, as well as more sophisticated techniques like Noisy Nets.

Paper Structure (27 sections, 4 equations, 4 figures, 2 algorithms)

This paper contains 27 sections, 4 equations, 4 figures, 2 algorithms.

Introduction
Contribution.
Paper Structure.
Preliminaries
Related Work
Intrinsic reward
Count-based
Probabilistic
Uncertainty
Goal-based
Dithering
Methods
Value Promise Discrepancy
Count-Based Exploration and SimHash
Homeostasis
...and 12 more sections

Figures (4)

Figure 1: Possible state hash conversion using $\kappa=64$ bits on an 8$\times$8 grid. Transition from a pre-processed Atari Pong game state (left) to the corresponding hashed state (right).
Figure 2: Average episode returns comparing VDSC against baseline methods in addition to each individual trigger in isolation. Shaded regions represent 95% confidence intervals over 3 random seeds.
Figure 3: Top: Detailed overview of exploration timings over 20 consecutive training episodes. White vertical bars represent steps in which the agent chose to explore. Bottom: Corresponding average trigger values tracked over the same 20 episodes.
Figure 4: Average episode returns for all Atari games. Shaded regions represent 95% confidence intervals over 3 random seeds.

VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts

TL;DR

Abstract

VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)