Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir; Matthew Zurek; Yudong Chen

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir, Matthew Zurek, Yudong Chen

Abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Abstract

-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form

, where

are the state and action space sizes, and

captures cumulative transition variance. This implies minimax-optimal average-reward and

-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span

, our algorithm obtains lower-order terms scaling as

, which we prove is optimal in both

and

. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than

, and we provide a prior-free algorithm whose lower-order terms scale as

, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on

in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Paper Structure (46 sections, 26 theorems, 136 equations, 3 figures, 1 table)

This paper contains 46 sections, 26 theorems, 136 equations, 3 figures, 1 table.

Introduction
Contributions
Related Work
Online Average-Reward
$\gamma$-regret
Preliminaries
MDP Basics
Online RL and Regrets
Burn-In Cost
Additional Notation
Main Results
Algorithm
Computational Complexity
Variance Parameters
Main Results for Discounted Setting
...and 31 more sections

Key Result

Lemma 3.2

For $\delta\in(0,1)$, we have with probability $1-\delta$ that $\mathrm{Var}_\gamma^\star \leq O(\Vert V^\star_\gamma\Vert _{\textnormal{sp}}T + \Vert V^\star_\gamma\Vert _{\textnormal{sp}}^2\log(T/\delta)).$

Figures (3)

Figure 1: An example of the MDPs used in the proof of Theorem \ref{['thm:no_prior_knowledge_H2_lb']}. Here each state-action pair is annotated with its reward. If the transition associated with a state-action pair is deterministic, it is denoted with a solid arrow. If it is stochastic, it is represented as a solid line splitting into multiple dashed arrows to different states, each annotated with the associated probability of that transition. The MDPs are parameterized by $B>2$, both have starting state $1$, and differ only in the transition distribution of the stay action of state $2$. In $P_1$ an optimal stationary policy traverses to state $2$ and stays there, while in $P_2$ an optimal stationary policy remains in state $1$.
Figure 2: An example of a hard MDP construction for $S=14$ and $A=3$. To avoid clutter, we omit an additional deterministic self-loop at each leaf state. We also omit the deterministic actions which transit from the leaf states to the root and from the good state to the root, as these actions only serve to keep the diameter bounded by $D$.
Figure 3: An example of the MDPs used in the proof of Theorem \ref{['thm:lower_bound_prior']}. If the transition associated with a state-action pair is deterministic, it is denoted with a solid arrow. If it is stochastic, it is represented as a solid line splitting into multiple dashed arrows to different states, each annotated with the associated probability of that transition. The MDPs are parameterized by $B > 1.$ Some actions, such as those which transit from leaf states back to the root state, are omitted.

Theorems & Definitions (40)

Definition 3.1
Lemma 3.2
Theorem 3.3: Variance-Dependent $\gamma$-Regret Bound
Lemma 3.4
Theorem 3.5: Variance-Dependent Regret Bound
Corollary 3.6: Regret Bound with Prior Knowledge
Corollary 3.7: Regret Bound without Prior Knowledge
Theorem 3.8: Burn-In Lower Bound For Prior-Free Algorithms
Theorem 3.9: General Burn-In Lower Bound
Lemma 4.1: Simplified Version of Theorem \ref{['thm:lower_bound_prior']}
...and 30 more

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Abstract

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (40)