Table of Contents
Fetching ...

Model-Free Active Exploration in Reinforcement Learning

Alessio Russo, Alexandre Proutiere

TL;DR

Addressing exploration in reinforcement learning, the paper targets Best Policy Identification under minimal samples by deriving a tractable, model-free surrogate upper bound $U(omega)$ for the instance-specific lower bound $T_psilon(omega)$ using value-function moments $M_{sa}^{k}[V^]$; it then instantiates MF-BPI and its deep variant DBMF-BPI with bootstrapped ensembles to handle parametric uncertainty. The approach avoids explicit model estimation, yet leverages a principled bound to guide exploration in both tabular and continuous MDPs. Empirical results show faster learning of near-optimal policies than state-of-the-art baselines on hard-exploration tasks like RiverSwim, Forked RiverSwim, DeepSea, and CartPole swingup. This work offers a practical, scalable framework for model-free exploration that integrates information-theoretic insights with ensemble uncertainty quantification to achieve sample-efficient policy identification.

Abstract

We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches

Model-Free Active Exploration in Reinforcement Learning

TL;DR

Addressing exploration in reinforcement learning, the paper targets Best Policy Identification under minimal samples by deriving a tractable, model-free surrogate upper bound for the instance-specific lower bound using value-function moments ; it then instantiates MF-BPI and its deep variant DBMF-BPI with bootstrapped ensembles to handle parametric uncertainty. The approach avoids explicit model estimation, yet leverages a principled bound to guide exploration in both tabular and continuous MDPs. Empirical results show faster learning of near-optimal policies than state-of-the-art baselines on hard-exploration tasks like RiverSwim, Forked RiverSwim, DeepSea, and CartPole swingup. This work offers a practical, scalable framework for model-free exploration that integrates information-theoretic insights with ensemble uncertainty quantification to achieve sample-efficient policy identification.

Abstract

We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches
Paper Structure (50 sections, 15 theorems, 70 equations, 22 figures, 6 tables, 5 algorithms)

This paper contains 50 sections, 15 theorems, 70 equations, 22 figures, 6 tables, 5 algorithms.

Key Result

Theorem 4.1

Consider a communicating MDP $\phi$ with a unique optimal policy $\pi^\star$. For all vectors $\omega\in \Delta(S\times A)$, with

Figures (22)

  • Figure 1: Comparison of the upper bounds (\ref{['eq:original_upper_bound']}) and (\ref{['eq:new_upper_bound']}) for different sizes of $S$ and $\gamma=0.95$. We evaluated different allocations using $U_0(\omega)$ and $U(\omega)$. The allocations are: $\omega_0^\star$ (the optimal allocation in (\ref{['eq:original_upper_bound']}), $\omega^\star$ (the optimal allocation in (\ref{['eq:new_upper_bound']}) and $\omega_1^\star$ (the optimal allocation in (\ref{['eq:new_bound_var_kbar_1']}) by setting $k=1$ uniformly across states and actions). For the random MDP we show the median value across $30$ runs.
  • Figure 2: Forced exploration example with $5$ states. We explore according to $\omega^{(t)}(s_t,a) = (1-\epsilon_t) \frac{\tilde{\omega}_t^\star(s_t,a)}{\sum_{a'}\tilde{\omega}_t^\star(s_t,a')} + \epsilon_t \frac{1}{|A|}$, mixing the estimate of the allocation $\tilde{\omega}^\star$ from \ref{['corollary:upper_bound_new_bound']} with a uniform policy, with $\epsilon_t = \max(10^{-3}, 1/N_t(s_t))$ where $N_t(s)$ indicates the number of times the agent visited state $s$ up to time $t$. Shade indicates $95\%$ confidence interval.
  • Figure 3: Evaluation of the estimated optimal policy $\pi_T^\star$ after $T$ steps for MF-BPI, Q-UCB, MDP-NaS and PSRL. Results are averaged across 10 seeds and lines indicate $95\%$ confidence intervals.
  • Figure 4: Cartpole swingup problem. On the left: total upright time at a difficulty level of $k=10$. On the right: total upright time after $200$ episodes for different difficulties $k$. To observe a positive reward, the pole's angle must satisfy $\cos(\theta) > k/20$, and the cart's position should satisfy $|x|\leq 1-k/20$. Bars and shaded areas indicate $95\%$ confidence intervals.
  • Figure 5: Exploration in Cartpole swingup for $k=5$. On the left, we show the entropy of visitation frequency for the state space $(x, \dot{x}, \theta, \dot{\theta})$ during training. On the right, we show a measure of the dispersion of the most recent visits; smaller values indicate that the agent is less explorative as $t$ increases.
  • ...and 17 more figures

Theorems & Definitions (25)

  • Theorem 4.1: al2021adaptive
  • Theorem 4.2
  • Proposition 5.1
  • Lemma B.1: Forced exploration
  • proof
  • Lemma C.1
  • proof : Proof of \ref{['lemma:bathia-davis-ineq']}
  • Theorem C.2: $(\delta,\varepsilon)$-PAC lower bound
  • Proposition C.3
  • proof
  • ...and 15 more