Finding good policies in average-reward Markov Decision Processes without prior knowledge
Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann
TL;DR
The paper tackles the problem of identifying an $\varepsilon$-optimal policy in average-reward MDPs without prior knowledge of key MDP parameters. It shows that estimating the optimal bias span $H$ is intractable in general and, in the online setting, that no algorithm can achieve a sample complexity polynomial in $H$; these results motivate diameter-based and data-driven approaches. The authors introduce Diameter Free Exploration (DFE), which leverages a diameter estimator to obtain a provably near-optimal PAC method in the generative model, with sample complexity $\widetilde{O}\left(\frac{SAD}{\varepsilon^2}\log\frac{1}{\delta}+D^2SA\right)$. They also develop online variants with $\widetilde{O}\left(\frac{SAD^2}{\varepsilon^2}\right)$-type guarantees and propose a novel VI-based stopping rule that is PAC for any sampling strategy, pointing toward further adaptive improvements. Overall, the work advances agnostic AR-MDP BPI and highlights fundamental limits while delivering practical, near-optimal algorithms for both offline (generative) and online settings.
Abstract
We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $Ω(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,δ)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
