Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Daniil Tiapkin; Evgenii Chzhen; Gilles Stoltz

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Daniil Tiapkin, Evgenii Chzhen, Gilles Stoltz

TL;DR

This work tackles online learning in episodic adversarial MDPs with full-information feedback, where rewards are revealed only after each episode. It proposes APO-MVP, a policy-optimization-based method that relies on dynamic programming and black-box online linear optimization over estimated advantages, avoiding occupancy-measure machinery. The authors prove a high-probability regret bound of $R_T = \tilde{O}(\mathrm{poly}(H)\sqrt{SAT})$ (up to additive $O(H^3SA)$ terms), improving the state-dependence from prior occupancy-measure approaches by removing a $\sqrt{S}$ factor and matching minimax lower bounds in $S,A,T$ (up to logarithmic factors). This bridges the gap between adversarial and stochastic MDPs and yields a practical, implementable algorithm with strong theoretical guarantees. Limitations include a rather steep $H$-dependence ($\tilde{O}(\sqrt{H^7})$) and the assumption of full-information feedback; extending the approach to bandit feedback remains an open direction with potential for further improvement.

Abstract

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT})$, where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $\sqrt{S}$, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $Ω(\sqrt{H^3SAT})$ as far as the dependencies in $S,A,T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

TL;DR

(up to additive

terms), improving the state-dependence from prior occupancy-measure approaches by removing a

factor and matching minimax lower bounds in

(up to logarithmic factors). This bridges the gap between adversarial and stochastic MDPs and yields a practical, implementable algorithm with strong theoretical guarantees. Limitations include a rather steep

-dependence (

) and the assumption of full-information feedback; extending the approach to bandit feedback remains an open direction with potential for further improvement.

Abstract

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during

episodes, each of which consists of

stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order

, where

and

are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of

, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound

as far as the dependencies in

are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).

Paper Structure (27 sections, 15 theorems, 86 equations, 2 algorithms)

This paper contains 27 sections, 15 theorems, 86 equations, 2 algorithms.

Introduction
Related work
Problem Formulation
Additional notation.
Algorithm and Main Result
Algorithm APO-MVP
Within-epoch statement.
Main Result
Comparison to Rosenberg & Mansour (2019b).
Comparison to cai2020provably.
Proof Sketch for Theorem \ref{['thm:main_result']}
Term $\mathbf{(B)}$: OLO Analysis
Additional Technical Concepts
Term $\mathbf{(A)}$: Optimism
Term $\mathbf{(D)}$: Bonus Summation
...and 12 more sections

Key Result

Theorem 4

Algorithm APO-MVP, used, for instance, with the OLO strategies based on polynomial or exponential potential (see Examples ex:polpot and ex:adahedge), satisfies, with probability at least $1-3\delta$,

Theorems & Definitions (27)

Remark 1: Two technical remarks
Theorem 4: Main theorem
proof
Lemma 5
Definition 6
Example 7
Example 8
Lemma 8: JMS23JMS25
Lemma 9
Lemma 10: Doob's optional skipping
...and 17 more

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

TL;DR

Abstract

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)