Reinforcement Learning and Regret Bounds for Admission Control

Lucas Weber; Ana Bušić; Jiamin Zhu

Reinforcement Learning and Regret Bounds for Admission Control

Lucas Weber, Ana Bušić, Jiamin Zhu

TL;DR

This work tackles regret minimization for multi-class admission control in $M/M/c/S$ queues with unknown arrival rates. By embedding the problem in a structured CTMDP and developing UCRL-AC, an optimistic planning scheme, the authors derive finite-time regret bounds that do not depend on the diameter and, in the infinite-server limit, remove dependence on the buffer size. They exploit gain-bias structure via Policy Iteration (and VI) to obtain bias-optimal policies and provide closed-form, efficient bias computations. Empirical results show improved regret performance across regimes, validating the theoretical guarantees and demonstrating practical viability for QoS-differentiated admission control.

Abstract

The expected regret of any reinforcement learning algorithm is lower bounded by $Ω\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an $M/M/c/S$ queue with $m$ job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size $S$, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the infinite server case, we prove that the dependence of the regret on $S$ disappears.

Reinforcement Learning and Regret Bounds for Admission Control

TL;DR

This work tackles regret minimization for multi-class admission control in

queues with unknown arrival rates. By embedding the problem in a structured CTMDP and developing UCRL-AC, an optimistic planning scheme, the authors derive finite-time regret bounds that do not depend on the diameter and, in the infinite-server limit, remove dependence on the buffer size. They exploit gain-bias structure via Policy Iteration (and VI) to obtain bias-optimal policies and provide closed-form, efficient bias computations. Empirical results show improved regret performance across regimes, validating the theoretical guarantees and demonstrating practical viability for QoS-differentiated admission control.

Abstract

The expected regret of any reinforcement learning algorithm is lower bounded by

for undiscounted returns, where

is the diameter of the Markov decision process,

the size of the state space,

the size of the action space and

the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an

queue with

job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size

, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by

in the finite server case. In the infinite server case, we prove that the dependence of the regret on

disappears.

Paper Structure (47 sections, 11 theorems, 77 equations, 2 figures, 3 algorithms)

This paper contains 47 sections, 11 theorems, 77 equations, 2 figures, 3 algorithms.

Introduction
Related work
Contributions
Problem Formulation
Admission Control Problem
Birth-and-Death Process
Background
Bellman Equation
Policy and Value Iteration
UCRL for Admission Control
Step 1: Confidence Intervals
Step 2: Optimistic CTMDP and Optimal Policy
Step 3: Exploration
Computing the Bias
Regret Analysis
...and 32 more sections

Key Result

Proposition 4.2

At the start of episode $k\geq 2$, with probability at least $1-\delta_{k-1}$, the true global arrival rate lies in the confidence interval defined by with $\varepsilon_{k,\Lambda}=4\frac{\Lambda_{\max}^2}{\Lambda_{\min}}\sqrt{\frac{2}{\nu_{k-1}}\log\frac{1}{\delta_{k-1}}}$, and $\hat{\Lambda}_k$ defined by eq: arrival rate estimators. We note $\hat{p}_{k}^{(i)}$ the empirical estimator of the pr

Figures (2)

Figure 1: Comparison of UCRL-AC with UCRL2, PSRL, KL-UCRL and UCRL3 with buffers of size $20, 50$ and individual service rates equal to $0.3, 0.4$ and $0.5$. We consider $5$ servers, $2$ job classes with immediate rewards $R_1=20$ and $R_2=10$ and arrival rates $\lambda_1 = 1$ and $\lambda_2=1$ respectively, and holding cost $C(t)=0.1t$ for both classes. For UCRL-AC, we used $\Lambda_{\min}=1$ and $\Lambda_{\max}=4$.
Figure 2: Theoretical upper bounds of UCRL2, UCRL3 and UCRL-AC and empirical regrets of UCRL2, PSRL, KL-UCRL, UCRL3 and UCRL-AC with the setting of Fig. \ref{['fig: empirical comparison']}(a).

Theorems & Definitions (15)

Definition 3.1: Gain optimal policy
Definition 4.1: Truncated empirical mean gaoLogarithmicRegretBounds2023
Proposition 4.2: Confidence intervals
Theorem 4.3
Proposition 5.1
Theorem 6.1
Theorem 6.2
Proposition 7.1
Definition 5.1: bias optimal policy (see feinbergOptimalityTrunkReservation2011)
Theorem 5.2
...and 5 more

Reinforcement Learning and Regret Bounds for Admission Control

TL;DR

Abstract

Reinforcement Learning and Regret Bounds for Admission Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (15)