Table of Contents
Fetching ...

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Xuefeng Gao, Xun Yu Zhou

TL;DR

This work addresses learning in continuous-time infinite-horizon average-reward MDPs (CTMDPs) with unknown transition rates and holding times. It establishes an instance-dependent logarithmic regret lower bound and designs the CT-UCRL algorithm, which uses refined holding-time estimators and an extended-value-iteration-based optimism procedure to achieve matching finite-time guarantees. The analysis hinges on a change-of-measure argument for CTMDPs, precise control of holding-time estimation via truncation, and a time-to-decision conversion through stochastic comparisons of point processes. The results demonstrate that logarithmic regret growth is achievable in CTMDPs, offering theoretically grounded performance guarantees and highlighting avenues for extending to larger or continuous state spaces and diffusion-type settings.

Abstract

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

TL;DR

This work addresses learning in continuous-time infinite-horizon average-reward MDPs (CTMDPs) with unknown transition rates and holding times. It establishes an instance-dependent logarithmic regret lower bound and designs the CT-UCRL algorithm, which uses refined holding-time estimators and an extended-value-iteration-based optimism procedure to achieve matching finite-time guarantees. The analysis hinges on a change-of-measure argument for CTMDPs, precise control of holding-time estimation via truncation, and a time-to-decision conversion through stochastic comparisons of point processes. The results demonstrate that logarithmic regret growth is achievable in CTMDPs, offering theoretically grounded performance guarantees and highlighting avenues for extending to larger or continuous state spaces and diffusion-type settings.

Abstract

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
Paper Structure (17 sections, 10 theorems, 92 equations, 1 algorithm)

This paper contains 17 sections, 10 theorems, 92 equations, 1 algorithm.

Key Result

Theorem 3.1

\newlabelthm:main0 For any learning algorithm $\mathcal{G}$ that is UF and any CTMDP $\mathcal{M} \in \mathcal{H}$, the expected regret up to $N-$th decision epoch satisfies where the instance-dependent constant $C(\mathcal{M})$ is given in eq:CM. Moreover

Theorems & Definitions (25)

  • Definition 2.3
  • Theorem 3.1: Logarithmic instance-dependent regret lower bound
  • Remark 3.2
  • Proposition 3.3
  • Proof 1
  • Proposition 3.4
  • Lemma 3.5
  • Proof 2
  • Lemma 3.6: Proposition 2 of burnetas1997optimal
  • Lemma 3.7
  • ...and 15 more