The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

Renye Yan; Yaozhong Gan; You Wu; Ling Liang; Junliang Xing; Yimao Cai; Ru Huang

The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

Renye Yan, Yaozhong Gan, You Wu, Ling Liang, Junliang Xing, Yimao Cai, Ru Huang

TL;DR

The paper tackles sparse rewards and the exploration-exploitation imbalance in reinforcement learning by adopting an entropy-based perspective. It introduces AdaZero, an end-to-end framework that uses a state autoencoder to generate intrinsic rewards and a mastery network to adaptively balance exploration and exploitation via a self-adaptive Bellman equation $Q_{total}(s,a)=\mathbb{E}_{τ}[\sum_{i} γ^i (R_{ext}(s_i,a_i)+(1-α(s_i))R_{int}(s_i,a_i))|s,a]$. Empirical results across 63 Atari and MuJoCo tasks show substantial improvements, including up to $15\times$ final returns on Montezuma's Revenge and robust generalization to other domains, all without environment-specific tuning. Visualization analyses validate the entropy-guided adaptive mechanism and illustrate how entropy and mastery evolve with training. Overall, the work demonstrates that end-to-end, entropy-aware adaptation can significantly improve sample efficiency and policy quality in diverse RL settings.

Abstract

The imbalance of exploration and exploitation has long been a significant challenge in reinforcement learning. In policy optimization, excessive reliance on exploration reduces learning efficiency, while over-dependence on exploitation might trap agents in local optima. This paper revisits the exploration-exploitation dilemma from the perspective of entropy by revealing the relationship between entropy and the dynamic adaptive process of exploration and exploitation. Based on this theoretical insight, we establish an end-to-end adaptive framework called AdaZero, which automatically determines whether to explore or to exploit as well as their balance of strength. Experiments show that AdaZero significantly outperforms baseline models across various Atari and MuJoCo environments with only a single setting. Especially in the challenging environment of Montezuma, AdaZero boosts the final returns by up to fifteen times. Moreover, we conduct a series of visualization analyses to reveal the dynamics of our self-adaptive mechanism, demonstrating how entropy reflects and changes with respect to the agent's performance and adaptive process.

The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

TL;DR

. Empirical results across 63 Atari and MuJoCo tasks show substantial improvements, including up to

final returns on Montezuma's Revenge and robust generalization to other domains, all without environment-specific tuning. Visualization analyses validate the entropy-guided adaptive mechanism and illustrate how entropy and mastery evolve with training. Overall, the work demonstrates that end-to-end, entropy-aware adaptation can significantly improve sample efficiency and policy quality in diverse RL settings.

Abstract

Paper Structure (23 sections, 16 equations, 13 figures, 1 algorithm)

This paper contains 23 sections, 16 equations, 13 figures, 1 algorithm.

Introduction
Preliminaries
Markov Decision Process (MDP).
Intrinsic-Based RL.
Method
Exploration-Exploitation Adaptation from the Perspective of Entropy
Self-Adaptive Bellman Equation.
AdaZero’s Framework
State Autoencoder.
Mastery Evaluation Network.
Adaptive Mechanisms.
Evaluation Experiments
Main Experiments
Generalization Experiments
Visualization Analysis
...and 8 more sections

Figures (13)

Figure 1: AdaZero's Framework. AdaZero consists of three main components: (A) State Autoencoder, (B) Evaluation Network for level of mastery, and (C) Adaptive Mechanism. The state autoencoder encodes and reconstructs states in raw images, where the reconstruction errors work as the driving force for the agent's exploration. The mastery evaluation network evaluates the reconstructed states and outputs the probability of $\hat{s}$ being real images as the balance factor $\alpha(\hat{s})$. Finally, $\alpha(\hat{s})$ is used in the adaptive mechanism to dynamically balance exploration and exploitation.
Figure 2: Main experiments in the most challenging Montezuma's Revenge. (a) We integrate AdaZero into three representative RL methods and show that AdaZero can bring significant improvements; (b) Our method also presents an advantageous performance compared with other advanced baselines. Equipped with AdaZero, RND reached the score published in the original paper within only one-tenth of the training steps.
Figure 3: Generalization experiments in discrete-space environments (Atari). The x-axis represents timesteps in 10 million.
Figure 4: Generalization experiments in MuJuCo, showing the generalizable performance of AdaZero in different types of continuous space tasks. The x-axis represents timesteps in million.
Figure 5: Entropy Visualization of AdaZero vs. PPO and RND in Atari. The x-axis represents timesteps in million.
...and 8 more figures

Theorems & Definitions (2)

proof
proof

The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

TL;DR

Abstract

The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (2)