Asymptotically optimal regret in communicating Markov decision processes

Victor Boone

Asymptotically optimal regret in communicating Markov decision processes

Victor Boone

TL;DR

The paper addresses regret minimization for communicating Markov decision processes in the average-reward setting, proving a tight first-order lower bound Reg(T; M, Λ) ≥ K(M) log(T) and constructing an algorithm that achieves Reg(T; M, Λ) = K(M) log(T) + o(log(T)). It introduces the ECoE framework (Exploration-Coexploration-Exploitation) to learn the hard regret constant K(M) and balance exploration, co-exploration, and exploitation, while accounting for the discontinuities of K(M) via a regularized, leveled formulation Ke(M) and near-optimal-pair concepts. The main contribution is an asymptotically optimal algorithm, ECoE*, that operates under Bernoulli rewards and a product-form ambient space, with formal regret guarantees and convergence properties for the regularized lower bound and its optimizers. The work also delineates future directions, including tractable implementations, extensions to broader reward models, and potential relaxations of the communicating assumption, highlighting a path toward universally optimal learning in complex MDPs.

Abstract

In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process $M$, our algorithm has regret $K(M) \log(T) + \mathrm{o}(\log(T))$ where $T$ is the number of learning steps and $K(M)$ is the best possible constant. This algorithm works by explicitly tracking the constant $K(M)$ to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function $K(M)$ is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate $K(M)$ with arbitrary precision from empirical data.

Asymptotically optimal regret in communicating Markov decision processes

TL;DR

Abstract

Asymptotically optimal regret in communicating Markov decision processes

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (105)