Table of Contents
Fetching ...

Reinforcement Learning and Regret Bounds for Admission Control

Lucas Weber, Ana Bušić, Jiamin Zhu

TL;DR

This work tackles regret minimization for multi-class admission control in $M/M/c/S$ queues with unknown arrival rates. By embedding the problem in a structured CTMDP and developing UCRL-AC, an optimistic planning scheme, the authors derive finite-time regret bounds that do not depend on the diameter and, in the infinite-server limit, remove dependence on the buffer size. They exploit gain-bias structure via Policy Iteration (and VI) to obtain bias-optimal policies and provide closed-form, efficient bias computations. Empirical results show improved regret performance across regimes, validating the theoretical guarantees and demonstrating practical viability for QoS-differentiated admission control.

Abstract

The expected regret of any reinforcement learning algorithm is lower bounded by $Ω\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an $M/M/c/S$ queue with $m$ job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size $S$, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the infinite server case, we prove that the dependence of the regret on $S$ disappears.

Reinforcement Learning and Regret Bounds for Admission Control

TL;DR

This work tackles regret minimization for multi-class admission control in queues with unknown arrival rates. By embedding the problem in a structured CTMDP and developing UCRL-AC, an optimistic planning scheme, the authors derive finite-time regret bounds that do not depend on the diameter and, in the infinite-server limit, remove dependence on the buffer size. They exploit gain-bias structure via Policy Iteration (and VI) to obtain bias-optimal policies and provide closed-form, efficient bias computations. Empirical results show improved regret performance across regimes, validating the theoretical guarantees and demonstrating practical viability for QoS-differentiated admission control.

Abstract

The expected regret of any reinforcement learning algorithm is lower bounded by for undiscounted returns, where is the diameter of the Markov decision process, the size of the state space, the size of the action space and the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an queue with job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size , making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by in the finite server case. In the infinite server case, we prove that the dependence of the regret on disappears.
Paper Structure (47 sections, 11 theorems, 77 equations, 2 figures, 3 algorithms)

This paper contains 47 sections, 11 theorems, 77 equations, 2 figures, 3 algorithms.

Key Result

Proposition 4.2

At the start of episode $k\geq 2$, with probability at least $1-\delta_{k-1}$, the true global arrival rate lies in the confidence interval defined by with $\varepsilon_{k,\Lambda}=4\frac{\Lambda_{\max}^2}{\Lambda_{\min}}\sqrt{\frac{2}{\nu_{k-1}}\log\frac{1}{\delta_{k-1}}}$, and $\hat{\Lambda}_k$ defined by eq: arrival rate estimators. We note $\hat{p}_{k}^{(i)}$ the empirical estimator of the pr

Figures (2)

  • Figure 1: Comparison of UCRL-AC with UCRL2, PSRL, KL-UCRL and UCRL3 with buffers of size $20, 50$ and individual service rates equal to $0.3, 0.4$ and $0.5$. We consider $5$ servers, $2$ job classes with immediate rewards $R_1=20$ and $R_2=10$ and arrival rates $\lambda_1 = 1$ and $\lambda_2=1$ respectively, and holding cost $C(t)=0.1t$ for both classes. For UCRL-AC, we used $\Lambda_{\min}=1$ and $\Lambda_{\max}=4$.
  • Figure 2: Theoretical upper bounds of UCRL2, UCRL3 and UCRL-AC and empirical regrets of UCRL2, PSRL, KL-UCRL, UCRL3 and UCRL-AC with the setting of Fig. \ref{['fig: empirical comparison']}(a).

Theorems & Definitions (15)

  • Definition 3.1: Gain optimal policy
  • Definition 4.1: Truncated empirical mean gaoLogarithmicRegretBounds2023
  • Proposition 4.2: Confidence intervals
  • Theorem 4.3
  • Proposition 5.1
  • Theorem 6.1
  • Theorem 6.2
  • Proposition 7.1
  • Definition 5.1: bias optimal policy (see feinbergOptimalityTrunkReservation2011)
  • Theorem 5.2
  • ...and 5 more