Finding good policies in average-reward Markov Decision Processes without prior knowledge

Adrienne Tuynman; Rémy Degenne; Emilie Kaufmann

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann

TL;DR

The paper tackles the problem of identifying an $\varepsilon$-optimal policy in average-reward MDPs without prior knowledge of key MDP parameters. It shows that estimating the optimal bias span $H$ is intractable in general and, in the online setting, that no algorithm can achieve a sample complexity polynomial in $H$; these results motivate diameter-based and data-driven approaches. The authors introduce Diameter Free Exploration (DFE), which leverages a diameter estimator to obtain a provably near-optimal PAC method in the generative model, with sample complexity $\widetilde{O}\left(\frac{SAD}{\varepsilon^2}\log\frac{1}{\delta}+D^2SA\right)$. They also develop online variants with $\widetilde{O}\left(\frac{SAD^2}{\varepsilon^2}\right)$-type guarantees and propose a novel VI-based stopping rule that is PAC for any sampling strategy, pointing toward further adaptive improvements. Overall, the work advances agnostic AR-MDP BPI and highlights fundamental limits while delivering practical, near-optimal algorithms for both offline (generative) and online settings.

Abstract

We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $Ω(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,δ)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.

Finding good policies in average-reward Markov Decision Processes without prior knowledge

TL;DR

The paper tackles the problem of identifying an

-optimal policy in average-reward MDPs without prior knowledge of key MDP parameters. It shows that estimating the optimal bias span

is intractable in general and, in the online setting, that no algorithm can achieve a sample complexity polynomial in

; these results motivate diameter-based and data-driven approaches. The authors introduce Diameter Free Exploration (DFE), which leverages a diameter estimator to obtain a provably near-optimal PAC method in the generative model, with sample complexity

. They also develop online variants with

-type guarantees and propose a novel VI-based stopping rule that is PAC for any sampling strategy, pointing toward further adaptive improvements. Overall, the work advances agnostic AR-MDP BPI and highlights fundamental limits while delivering practical, near-optimal algorithms for both offline (generative) and online settings.

Abstract

We revisit the identification of an

-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter,

, and the optimal bias span,

, which satisfy

. Prior work have studied the complexity of

-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with

for which the sample complexity to output an

-optimal policy is

where

and

are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order

has been proposed, but it requires the knowledge of

. We first show that the sample complexity required to estimate

is not bounded by any function of

and

, ruling out the possibility to easily make the previous algorithm agnostic to

. By relying instead on a diameter estimation procedure, we propose the first algorithm for

-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in

in the regime of small

, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in

cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in

, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.

Paper Structure (33 sections, 16 theorems, 86 equations, 7 figures, 1 table, 4 algorithms)

This paper contains 33 sections, 16 theorems, 86 equations, 7 figures, 1 table, 4 algorithms.

Introduction
Contributions
Related work
On the hardness of estimating $H$
Sketch of proof
A near-optimal algorithm without prior knowledge
On the hardness of best policy identification in the online setting
On the hardness of a regret to PAC conversion
A hardness result
Sketch of proof
Algorithms for the online setting
Diameter Free Exploration for the online setting
Towards more adaptive algorithms
The Value Iteration stopping rule
Choosing a sampling rule
...and 18 more sections

Key Result

Theorem 1

For any $\delta < \frac{1}{2e^4}$, $T>0$, $\Delta$, there exists an ergodic MDP $\mathcal{M}$ with optimal bias span $H=1/2$, $S=3$ and $A=2$ such that any algorithm that computes a $\Delta$-tight upper bound for the optimal bias span with probability $1-\delta$ under a generative model assumption n

Figures (7)

Figure 1: MDP $\mathcal{M}_R$, the hard instance for Theorem \ref{['th:Hhardtoestimate']}. Each arrow corresponds to a state-action and next state combination, and is annotated with the mean reward of the action and the probability of the transition. Arrows with a different line style correspond to different actions.
Figure 2: MDP $\mathcal{M}_{p,p'}$ with high mixing time.
Figure 3: MDP $\mathcal{M}_j$, the hard instance for Theorem \ref{['th:online']}
Figure 4: $\widetilde{\mathcal{M}}_R$ in the ergodic case
Figure \ref{fig:mixing}: MDP $\mathcal{M}_{p,p'}$ (repeated from page \ref{['fig:mixing']})
...and 2 more figures

Theorems & Definitions (27)

Definition 1
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Definition 2
Theorem 6
Lemma 1
Lemma 2: Lemma 2 in TarbouriechPirotta21
...and 17 more

Finding good policies in average-reward Markov Decision Processes without prior knowledge

TL;DR

Abstract

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (27)