Model-Free Active Exploration in Reinforcement Learning
Alessio Russo, Alexandre Proutiere
TL;DR
Addressing exploration in reinforcement learning, the paper targets Best Policy Identification under minimal samples by deriving a tractable, model-free surrogate upper bound $U(omega)$ for the instance-specific lower bound $T_psilon(omega)$ using value-function moments $M_{sa}^{k}[V^]$; it then instantiates MF-BPI and its deep variant DBMF-BPI with bootstrapped ensembles to handle parametric uncertainty. The approach avoids explicit model estimation, yet leverages a principled bound to guide exploration in both tabular and continuous MDPs. Empirical results show faster learning of near-optimal policies than state-of-the-art baselines on hard-exploration tasks like RiverSwim, Forked RiverSwim, DeepSea, and CartPole swingup. This work offers a practical, scalable framework for model-free exploration that integrates information-theoretic insights with ensemble uncertainty quantification to achieve sample-efficient policy identification.
Abstract
We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches
