Table of Contents
Fetching ...

Asymptotically optimal regret in communicating Markov decision processes

Victor Boone

TL;DR

The paper addresses regret minimization for communicating Markov decision processes in the average-reward setting, proving a tight first-order lower bound Reg(T; M, Λ) ≥ K(M) log(T) and constructing an algorithm that achieves Reg(T; M, Λ) = K(M) log(T) + o(log(T)). It introduces the ECoE framework (Exploration-Coexploration-Exploitation) to learn the hard regret constant K(M) and balance exploration, co-exploration, and exploitation, while accounting for the discontinuities of K(M) via a regularized, leveled formulation Ke(M) and near-optimal-pair concepts. The main contribution is an asymptotically optimal algorithm, ECoE*, that operates under Bernoulli rewards and a product-form ambient space, with formal regret guarantees and convergence properties for the regularized lower bound and its optimizers. The work also delineates future directions, including tractable implementations, extensions to broader reward models, and potential relaxations of the communicating assumption, highlighting a path toward universally optimal learning in complex MDPs.

Abstract

In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process $M$, our algorithm has regret $K(M) \log(T) + \mathrm{o}(\log(T))$ where $T$ is the number of learning steps and $K(M)$ is the best possible constant. This algorithm works by explicitly tracking the constant $K(M)$ to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function $K(M)$ is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate $K(M)$ with arbitrary precision from empirical data.

Asymptotically optimal regret in communicating Markov decision processes

TL;DR

The paper addresses regret minimization for communicating Markov decision processes in the average-reward setting, proving a tight first-order lower bound Reg(T; M, Λ) ≥ K(M) log(T) and constructing an algorithm that achieves Reg(T; M, Λ) = K(M) log(T) + o(log(T)). It introduces the ECoE framework (Exploration-Coexploration-Exploitation) to learn the hard regret constant K(M) and balance exploration, co-exploration, and exploitation, while accounting for the discontinuities of K(M) via a regularized, leveled formulation Ke(M) and near-optimal-pair concepts. The main contribution is an asymptotically optimal algorithm, ECoE*, that operates under Bernoulli rewards and a product-form ambient space, with formal regret guarantees and convergence properties for the regularized lower bound and its optimizers. The work also delineates future directions, including tractable implementations, extensions to broader reward models, and potential relaxations of the communicating assumption, highlighting a path toward universally optimal learning in complex MDPs.

Abstract

In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process , our algorithm has regret where is the number of learning steps and is the best possible constant. This algorithm works by explicitly tracking the constant to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate with arbitrary precision from empirical data.

Paper Structure

This paper contains 100 sections, 82 theorems, 359 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Fix $\mathcal{M}$ a space of Markov decision process. For every learning algorithm $\Lambda$ that is consistent on $\mathcal{M}$, for all $M \in \mathcal{M}$ that is communicating (assumption_communicating) and $s_0 \in \mathcal{S}$, we have $\textrm{\normalfont Reg}(T; M, \Lambda, s_0) \ge K(M) \lo where $\mathrm{KL}(M(z)||M^\dagger(z)) := \mathrm{KL}(\nu(z)||\nu^\dagger(z)) + \mathrm{KL}(p(z)||p

Figures (6)

  • Figure 1: A class of Markov decision processes with deterministic transitions parameterized by $\theta$ where co-exploration is troublesome. Arrows are choices of actions that deterministically lead to the pointed state and labels are rewards.
  • Figure 2: An example of discontinuity of the regret lower bound. The displayed transitions are deterministic and the labels represent the means of the attached Bernoulli rewards. Actions are distinguished by unique symbols for better readability of the contraction. Optimal pairs are colored in red.
  • Figure 3: An example of leveling transform. On the left, a Markov decision process $M$ with two states and two actions per state. Transitions are deterministic and represented by arrows. Labels are rewards. The model $M'$ is a noisy version of $M$.
  • Figure 4: A discontinuity of $\Delta^{*}(M)$. A class of Bernoulli reward models with deterministic transitions parameterized by $\theta \in \Theta \equiv [-\frac{1}{2}, \frac{1}{2}]$. Arrows are choices of actions that deterministically lead to the pointed state and labels are mean rewards.
  • Figure 5: A discontinuity of $K_{\bar{\epsilon}} (M)$ when the product form (\ref{['assumption_space']}) is dropped. A set of Bernoulli reward models with deterministic transitions parameterized by $\theta \in \Theta \equiv [0, 1]^5$ (to the left). Arrows are choices of actions that deterministically lead to the pointed state and labels are mean rewards.
  • ...and 1 more figures

Theorems & Definitions (105)

  • Definition 1: Optimal, weakly optimal and sub-optimal pairs
  • Definition 2: Regret, auer_near_optimal_2009
  • Definition 3: Consistency
  • Definition 4: Invariant measures
  • Definition 5: Confusing set, boone_regret_2025
  • Theorem 1: boone_regret_2025
  • Proposition 2: Policy-wise formulation of $K(M)$
  • Definition 6: Component of pairs
  • Definition 7: Leveling transform
  • Definition 8: Leveled optimal pairs
  • ...and 95 more