Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization
Daniil Tiapkin, Evgenii Chzhen, Gilles Stoltz
TL;DR
This work tackles online learning in episodic adversarial MDPs with full-information feedback, where rewards are revealed only after each episode. It proposes APO-MVP, a policy-optimization-based method that relies on dynamic programming and black-box online linear optimization over estimated advantages, avoiding occupancy-measure machinery. The authors prove a high-probability regret bound of $R_T = \tilde{O}(\mathrm{poly}(H)\sqrt{SAT})$ (up to additive $O(H^3SA)$ terms), improving the state-dependence from prior occupancy-measure approaches by removing a $\sqrt{S}$ factor and matching minimax lower bounds in $S,A,T$ (up to logarithmic factors). This bridges the gap between adversarial and stochastic MDPs and yields a practical, implementable algorithm with strong theoretical guarantees. Limitations include a rather steep $H$-dependence ($\tilde{O}(\sqrt{H^7})$) and the assumption of full-information feedback; extending the approach to bandit feedback remains an open direction with potential for further improvement.
Abstract
We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT})$, where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $\sqrt{S}$, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $Ω(\sqrt{H^3SAT})$ as far as the dependencies in $S,A,T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).
