Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Xuefeng Gao; Xun Yu Zhou

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Xuefeng Gao, Xun Yu Zhou

TL;DR

This work addresses learning in continuous-time infinite-horizon average-reward MDPs (CTMDPs) with unknown transition rates and holding times. It establishes an instance-dependent logarithmic regret lower bound and designs the CT-UCRL algorithm, which uses refined holding-time estimators and an extended-value-iteration-based optimism procedure to achieve matching finite-time guarantees. The analysis hinges on a change-of-measure argument for CTMDPs, precise control of holding-time estimation via truncation, and a time-to-decision conversion through stochastic comparisons of point processes. The results demonstrate that logarithmic regret growth is achievable in CTMDPs, offering theoretically grounded performance guarantees and highlighting avenues for extending to larger or continuous state spaces and diffusion-type settings.

Abstract

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

TL;DR

Abstract

Paper Structure (17 sections, 10 theorems, 92 equations, 1 algorithm)

This paper contains 17 sections, 10 theorems, 92 equations, 1 algorithm.

Introduction
Formulation of Learning in CTMDPs
Asymptotic Instance-Dependent Regret Lower Bound
Preliminaries
Main result
Proof of Theorem \ref{['thm:main']}
The CT--UCRL Algorithm and Its Instance-Dependent Regret Upper Bound
The CT-UCRL algorithm
Refined estimator for mean holding time
The CT-UCRL algorithm
Regret upper bound
Proof of Theorem \ref{['thm:UCRL']}
Failing confidence regions
Bounding the number of suboptimal decision steps of CT-UCRL
Proof of Theorem \ref{['thm:UCRL']}
...and 2 more sections

Key Result

Theorem 3.1

\newlabelthm:main0 For any learning algorithm $\mathcal{G}$ that is UF and any CTMDP $\mathcal{M} \in \mathcal{H}$, the expected regret up to $N-$th decision epoch satisfies where the instance-dependent constant $C(\mathcal{M})$ is given in eq:CM. Moreover

Theorems & Definitions (25)

Definition 2.3
Theorem 3.1: Logarithmic instance-dependent regret lower bound
Remark 3.2
Proposition 3.3
Proof 1
Proposition 3.4
Lemma 3.5
Proof 2
Lemma 3.6: Proposition 2 of burnetas1997optimal
Lemma 3.7
...and 15 more

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

TL;DR

Abstract

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (25)