Model-Free Active Exploration in Reinforcement Learning

Alessio Russo; Alexandre Proutiere

Model-Free Active Exploration in Reinforcement Learning

Alessio Russo, Alexandre Proutiere

TL;DR

Addressing exploration in reinforcement learning, the paper targets Best Policy Identification under minimal samples by deriving a tractable, model-free surrogate upper bound $U(omega)$ for the instance-specific lower bound $T_psilon(omega)$ using value-function moments $M_{sa}^{k}[V^]$; it then instantiates MF-BPI and its deep variant DBMF-BPI with bootstrapped ensembles to handle parametric uncertainty. The approach avoids explicit model estimation, yet leverages a principled bound to guide exploration in both tabular and continuous MDPs. Empirical results show faster learning of near-optimal policies than state-of-the-art baselines on hard-exploration tasks like RiverSwim, Forked RiverSwim, DeepSea, and CartPole swingup. This work offers a practical, scalable framework for model-free exploration that integrates information-theoretic insights with ensemble uncertainty quantification to achieve sample-efficient policy identification.

Abstract

We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches

Model-Free Active Exploration in Reinforcement Learning

TL;DR

Addressing exploration in reinforcement learning, the paper targets Best Policy Identification under minimal samples by deriving a tractable, model-free surrogate upper bound

for the instance-specific lower bound

using value-function moments

; it then instantiates MF-BPI and its deep variant DBMF-BPI with bootstrapped ensembles to handle parametric uncertainty. The approach avoids explicit model estimation, yet leverages a principled bound to guide exploration in both tabular and continuous MDPs. Empirical results show faster learning of near-optimal policies than state-of-the-art baselines on hard-exploration tasks like RiverSwim, Forked RiverSwim, DeepSea, and CartPole swingup. This work offers a practical, scalable framework for model-free exploration that integrates information-theoretic insights with ensemble uncertainty quantification to achieve sample-efficient policy identification.

Abstract

Paper Structure (50 sections, 15 theorems, 70 equations, 22 figures, 6 tables, 5 algorithms)

This paper contains 50 sections, 15 theorems, 70 equations, 22 figures, 6 tables, 5 algorithms.

Introduction
Related Work
Preliminaries
Towards Efficient Exploration Allocations
Upper bounds on $T_{\varepsilon}(\omega)$
Example on Tabular MDPs
Model-Free Active Exploration Algorithms
Exploration in tabular MDPs.
Extension to Deep Reinforcement Learning
Numerical Results
Conclusions
Numerical Results
The Forked Riverswim Environment
Details of Example \ref{['example:randomly_drawn_mdp_value']}
Riverswim and Forked Riverswim - Description and Additional Results
...and 35 more sections

Key Result

Theorem 4.1

Consider a communicating MDP $\phi$ with a unique optimal policy $\pi^\star$. For all vectors $\omega\in \Delta(S\times A)$, with

Figures (22)

Figure 1: Comparison of the upper bounds (\ref{['eq:original_upper_bound']}) and (\ref{['eq:new_upper_bound']}) for different sizes of $S$ and $\gamma=0.95$. We evaluated different allocations using $U_0(\omega)$ and $U(\omega)$. The allocations are: $\omega_0^\star$ (the optimal allocation in (\ref{['eq:original_upper_bound']}), $\omega^\star$ (the optimal allocation in (\ref{['eq:new_upper_bound']}) and $\omega_1^\star$ (the optimal allocation in (\ref{['eq:new_bound_var_kbar_1']}) by setting $k=1$ uniformly across states and actions). For the random MDP we show the median value across $30$ runs.
Figure 2: Forced exploration example with $5$ states. We explore according to $\omega^{(t)}(s_t,a) = (1-\epsilon_t) \frac{\tilde{\omega}_t^\star(s_t,a)}{\sum_{a'}\tilde{\omega}_t^\star(s_t,a')} + \epsilon_t \frac{1}{|A|}$, mixing the estimate of the allocation $\tilde{\omega}^\star$ from \ref{['corollary:upper_bound_new_bound']} with a uniform policy, with $\epsilon_t = \max(10^{-3}, 1/N_t(s_t))$ where $N_t(s)$ indicates the number of times the agent visited state $s$ up to time $t$. Shade indicates $95\%$ confidence interval.
Figure 3: Evaluation of the estimated optimal policy $\pi_T^\star$ after $T$ steps for MF-BPI, Q-UCB, MDP-NaS and PSRL. Results are averaged across 10 seeds and lines indicate $95\%$ confidence intervals.
Figure 4: Cartpole swingup problem. On the left: total upright time at a difficulty level of $k=10$. On the right: total upright time after $200$ episodes for different difficulties $k$. To observe a positive reward, the pole's angle must satisfy $\cos(\theta) > k/20$, and the cart's position should satisfy $|x|\leq 1-k/20$. Bars and shaded areas indicate $95\%$ confidence intervals.
Figure 5: Exploration in Cartpole swingup for $k=5$. On the left, we show the entropy of visitation frequency for the state space $(x, \dot{x}, \theta, \dot{\theta})$ during training. On the right, we show a measure of the dispersion of the most recent visits; smaller values indicate that the agent is less explorative as $t$ increases.
...and 17 more figures

Theorems & Definitions (25)

Theorem 4.1: al2021adaptive
Theorem 4.2
Proposition 5.1
Lemma B.1: Forced exploration
proof
Lemma C.1
proof : Proof of \ref{['lemma:bathia-davis-ineq']}
Theorem C.2: $(\delta,\varepsilon)$-PAC lower bound
Proposition C.3
proof
...and 15 more

Model-Free Active Exploration in Reinforcement Learning

TL;DR

Abstract

Model-Free Active Exploration in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (25)