Table of Contents
Fetching ...

Tail Distribution of Regret in Optimistic Reinforcement Learning

Sajad Khodadadian, Mehrdad Moharrami

TL;DR

This paper addresses how the regret of an optimistic, episodic RL algorithm concentrates around its mean in finite-horizon tabular MDPs with unknown dynamics. It introduces two bonus schedules, KD and KI, within a UCBVI framework and proves instance-dependent tail bounds with a two-regime structure: a sub-Gaussian core up to an instance-dependent scale $m_K$ and a sub-Weibull tail beyond, governed by a global $Q^*$-gap $\texttt{gap}^*$. It also provides an explicit bound on the expected regret and shows how a tuning parameter $\alpha\in[0,1]$ trades off mean performance against tail risk. To the best of our knowledge, these are among the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning, offering a finer risk-sensitive understanding beyond traditional high-probability or expectation-focused results.

Abstract

We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $\Pr(R_K \ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $\mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.

Tail Distribution of Regret in Optimistic Reinforcement Learning

TL;DR

This paper addresses how the regret of an optimistic, episodic RL algorithm concentrates around its mean in finite-horizon tabular MDPs with unknown dynamics. It introduces two bonus schedules, KD and KI, within a UCBVI framework and proves instance-dependent tail bounds with a two-regime structure: a sub-Gaussian core up to an instance-dependent scale and a sub-Weibull tail beyond, governed by a global -gap . It also provides an explicit bound on the expected regret and shows how a tuning parameter trades off mean performance against tail risk. To the best of our knowledge, these are among the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning, offering a finer risk-sensitive understanding beyond traditional high-probability or expectation-focused results.

Abstract

We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret over episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a -dependent scheme that explicitly incorporates the total number of episodes , and (ii) a -independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret . The proposed algorithm depends on a tuning parameter , which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.

Paper Structure

This paper contains 8 sections, 5 theorems, 55 equations, 1 algorithm.

Key Result

Theorem 1

Fix $\alpha\in[0,1]$. For all $\gamma \in [0,1]$, the tail of the cumulative regret $R_K$ satisfies where and

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Corollary 1
  • proof : Proof of Corollary \ref{['cor:tailExpBound-UCBVI']}
  • proof : Proof of Theorem \ref{['thm:tailGap-UCBVI']}
  • proof : Proof of Theorem \ref{['thm:expectGap-UCBVI']}
  • Lemma 1
  • proof : Proof of Lemma \ref{['lem:sumUpperBound-UCBVI']}
  • Lemma 2
  • proof : Proof of Lemma \ref{['lem:gapDepOptimality-UCBVI']}