Table of Contents
Fetching ...

Maximum-Entropy Exploration with Future State-Action Visitation Measures

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

Abstract

Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.

Maximum-Entropy Exploration with Future State-Action Visitation Measures

Abstract

Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.
Paper Structure (31 sections, 4 theorems, 23 equations, 4 figures, 1 algorithm)

This paper contains 31 sections, 4 theorems, 23 equations, 4 figures, 1 algorithm.

Key Result

Theorem 3.2

For any policy $\pi$ and any relative measure $q^*$, the marginal and conditional visitation measures satisfy where $L$ is a finite constant and $\tilde{d}^{\pi, \gamma}(\bar{s}, \bar{a}) = \mathbb{E}_{s, a \sim d^{\pi, \gamma}(\cdot, \cdot)} [ d^{\pi, \gamma}(\bar{s}, \bar{a}| s, a)]$.

Figures (4)

  • Figure 1: Learning results for two representative environments, selected from Appendix \ref{['apx:learning']}. The first column represents the evolution of $- KL_z(d^{\pi, \gamma}(z) || q^*(z) )$, and the second column the evolution of $- \mathbb{E}_{s_0 \sim p_0(\cdot )} \left [ KL_z(d^{\pi, \gamma}(z|s_0) || q^*(z) ) \right ]$, when learning exploration policies. The third column represents the evolution of the expected return when learning MaxEntRL control policies.
  • Figure 2: Evolution of the entropy of the discounted visitation probability measure of the position of the agent on the grid when computing exploration policies (i.e., when neglecting the rewards of the MDP). The entropy is computed empirically with Monte Carlo simulations. For each iteration, the interquartile mean over 6 runs is reported, along with its $95\%$ confidence interval.
  • Figure 3: Evolution of the conditional entropy of the discounted visitation probability measure of the position of the agent on the grid when computing exploration policies (i.e., when neglecting the rewards of the MDP). The entropy is computed empirically with Monte Carlo simulations. For each iteration, the interquartile mean over 6 runs is reported, along with its $95\%$ confidence interval.
  • Figure 4: Expected return during the policy optimization. The expectation is computed empirically with Monte Carlo simulations. For each iteration, the interquartile mean over 6 runs is reported, along with its $95\%$ confidence interval.

Theorems & Definitions (7)

  • Definition 3.1
  • Theorem 3.2
  • Definition 4.1
  • Theorem 4.2
  • Definition 4.3
  • Theorem 4.4
  • Theorem 4.5