Table of Contents
Fetching ...

Maximum Entropy Exploration Without the Rollouts

Jacob Adamczyk, Adam Kamoski, Rahul V. Kulkarni

Abstract

Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.

Maximum Entropy Exploration Without the Rollouts

Abstract

Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.
Paper Structure (18 sections, 3 theorems, 29 equations, 1 figure, 1 algorithm)

This paper contains 18 sections, 3 theorems, 29 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1

Let the dynamics $p(s'|s,a)$ be irreducible and aperiodic, and denote by $m$ the index of primitivity for the Markov chain over state-actions induced by $\pi_0$. The mapping $u\leftarrow \mathcal{T}(u)$ given by Equation eq:u-update is a contraction under the projective metric, and converges linearl

Figures (1)

  • Figure 1: EVE converges to an exploration policy that achieves maximum entropy. Compared to the baselines, the optimal policy found by EVE produces a higher entropy and converges much faster. (Inset) "CliffWorld" environment used. The green circle denotes the initial state; stepping into the cliff resets the agent. Each line represents the mean over 5 independent initializations and the shaded region denotes one standard deviation.

Theorems & Definitions (5)

  • Theorem 1: Convergence of EVE
  • Lemma 1
  • Theorem 1
  • proof
  • proof