Table of Contents
Fetching ...

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Saghar Adler, Vijay Subramanian

TL;DR

The paper tackles learning optimal policies for countable-state MDPs with unknown parameters in a Bayesian setting. It introduces Thompson sampling with dynamic episodes (TSDE) and leverages geometric and polynomial ergodicity to establish a finite-time Bayesian regret bound of $\tilde{O}(d h^d \sqrt{|\mathcal A| T})$. The approach is validated on two queueing models with unknown dynamics, demonstrating sublinear regret and practical applicability, including an epsilon-optimal policy variant when an exact optimal policy oracle is unavailable. The work advances reinforcement learning for unbounded, countable state spaces by linking posterior sampling, ergodic stability, and average-cost optimality, with implications for networked and stochastic systems control.

Abstract

Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $θ\inΘ$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbolθ^*$ generated via a given fixed prior distribution on $Θ$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

TL;DR

The paper tackles learning optimal policies for countable-state MDPs with unknown parameters in a Bayesian setting. It introduces Thompson sampling with dynamic episodes (TSDE) and leverages geometric and polynomial ergodicity to establish a finite-time Bayesian regret bound of . The approach is validated on two queueing models with unknown dynamics, demonstrating sublinear regret and practical applicability, including an epsilon-optimal policy variant when an exact optimal policy oracle is unavailable. The work advances reinforcement learning for unbounded, countable state spaces by linking posterior sampling, ergodic stability, and average-cost optimality, with implications for networked and stochastic systems control.

Abstract

Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter , and defined on a countably-infinite state space , with finite action space , and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter generated via a given fixed prior distribution on . To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an upper bound on the Bayesian regret of our algorithm, where is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.
Paper Structure (42 sections, 20 theorems, 247 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 42 sections, 20 theorems, 247 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

For $p \in \mathbb{N}$, the $p$-th moment of $\max_{1\leq i \leq T} \tau_{0^d}^{(i)}$ and $M^T_{\boldsymbol\theta^*}$, that is the maximum $\ell_{\infty}$-norm of the state vector achieved up until and including time $T$ is $O(\log^p T)$.

Figures (7)

  • Figure 1: MDP evolution in episode $k<K_T$.
  • Figure 2: Two-server queueing systems with heterogeneous service rates.
  • Figure 3: Regret performance for $\lambda=0.3,0.5,0.7$. Shaded region shows the $\pm \sigma$ area of mean regret.
  • Figure 4: Comparison of the regret performance of Algorithm 2 (referred to as TSDE) with the algorithm proposed by agrawal1989certainty (denoted as AgrawalTeneketzis) and the algorithm proposed by kumar1982new (denoted as RBMLE) for the queueing models of Figure 3.2.
  • Figure 5: Total variation distance between the posterior and real distribution for $\lambda=0.3,0.5,0.7$. The y axis is plotted on a logarithmic scale to display the differences clearly.
  • ...and 2 more figures

Theorems & Definitions (46)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Remark 5
  • Remark 6
  • Remark 7
  • ...and 36 more