Table of Contents
Fetching ...

Queueing Matching Bandits with Preference Feedback

Jung-hun Kim, Min-hwan Oh

TL;DR

This work tackles learning in dynamic queueing systems with unknown, feature-based service rates encoded by a multinomial logit model and preference feedback. It introduces two online policies, UCB-QMB and TS-QMB, that integrate MaxWeight scheduling with confidence-based indices and Bayesian sampling to stabilize queues while learning the service parameters. The authors prove finite-time stability with 𝒬(T)=O( min{N,K}/ε ) and derive sublinear regret bounds of the form ̃O(min{√T Q_max, T^{3/4}}), with detailed dependence on problem dimensions and regularity κ. Empirical results on synthetic data corroborate the theoretical findings, showing near-oracle stability and favorable regret behavior, indicating practical relevance for applications like ride-hailing and online labor markets.

Abstract

In this study, we consider multi-class multi-server asymmetric queueing systems consisting of $N$ queues on one side and $K$ servers on the other side, where jobs randomly arrive in queues at each time. The service rate of each job-server assignment is unknown and modeled by a feature-based Multi-nomial Logit (MNL) function. At each time, a scheduler assigns jobs to servers, and each server stochastically serves at most one job based on its preferences over the assigned jobs. The primary goal of the algorithm is to stabilize the queues in the system while learning the service rates of servers. To achieve this goal, we propose algorithms based on UCB and Thompson Sampling, which achieve system stability with an average queue length bound of $O(\min\{N,K\}/ε)$ for a large time horizon $T$, where $ε$ is a traffic slackness of the system. Furthermore, the algorithms achieve sublinear regret bounds of $\tilde{O}(\min\{\sqrt{T} Q_{\max},T^{3/4}\})$, where $Q_{\max}$ represents the maximum queue length over agents and times. Lastly, we provide experimental results to demonstrate the performance of our algorithms.

Queueing Matching Bandits with Preference Feedback

TL;DR

This work tackles learning in dynamic queueing systems with unknown, feature-based service rates encoded by a multinomial logit model and preference feedback. It introduces two online policies, UCB-QMB and TS-QMB, that integrate MaxWeight scheduling with confidence-based indices and Bayesian sampling to stabilize queues while learning the service parameters. The authors prove finite-time stability with 𝒬(T)=O( min{N,K}/ε ) and derive sublinear regret bounds of the form ̃O(min{√T Q_max, T^{3/4}}), with detailed dependence on problem dimensions and regularity κ. Empirical results on synthetic data corroborate the theoretical findings, showing near-oracle stability and favorable regret behavior, indicating practical relevance for applications like ride-hailing and online labor markets.

Abstract

In this study, we consider multi-class multi-server asymmetric queueing systems consisting of queues on one side and servers on the other side, where jobs randomly arrive in queues at each time. The service rate of each job-server assignment is unknown and modeled by a feature-based Multi-nomial Logit (MNL) function. At each time, a scheduler assigns jobs to servers, and each server stochastically serves at most one job based on its preferences over the assigned jobs. The primary goal of the algorithm is to stabilize the queues in the system while learning the service rates of servers. To achieve this goal, we propose algorithms based on UCB and Thompson Sampling, which achieve system stability with an average queue length bound of for a large time horizon , where is a traffic slackness of the system. Furthermore, the algorithms achieve sublinear regret bounds of , where represents the maximum queue length over agents and times. Lastly, we provide experimental results to demonstrate the performance of our algorithms.

Paper Structure

This paper contains 33 sections, 19 theorems, 148 equations, 3 figures, 1 table, 4 algorithms.

Key Result

Proposition 1

Given the prior knowledge of $\theta_k$ for all $k\in[K]$, the average queue length of MaxWeight is bounded as $\mathcal{Q}(T)=\mathcal{O}\left(\frac{\min\{N,K\}}{\epsilon}\right)$, which implies that the algorithm achieves stability.

Figures (3)

  • Figure 1: Illustration of queueing process with 4 queues/agents ($N = 4$) and 3 servers/arms ($K = 3$)
  • Figure 2: Experimental results for (left) average queue length and (right) regret
  • Figure 3: Experimental results with $N=4, K=3, L=2, d=2$ for (left) average queue length and (right) regret

Theorems & Definitions (20)

  • Definition 1
  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 1: Lemma 9 in oh2021multinomial
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 10 more