Queueing Matching Bandits with Preference Feedback
Jung-hun Kim, Min-hwan Oh
TL;DR
This work tackles learning in dynamic queueing systems with unknown, feature-based service rates encoded by a multinomial logit model and preference feedback. It introduces two online policies, UCB-QMB and TS-QMB, that integrate MaxWeight scheduling with confidence-based indices and Bayesian sampling to stabilize queues while learning the service parameters. The authors prove finite-time stability with 𝒬(T)=O( min{N,K}/ε ) and derive sublinear regret bounds of the form ̃O(min{√T Q_max, T^{3/4}}), with detailed dependence on problem dimensions and regularity κ. Empirical results on synthetic data corroborate the theoretical findings, showing near-oracle stability and favorable regret behavior, indicating practical relevance for applications like ride-hailing and online labor markets.
Abstract
In this study, we consider multi-class multi-server asymmetric queueing systems consisting of $N$ queues on one side and $K$ servers on the other side, where jobs randomly arrive in queues at each time. The service rate of each job-server assignment is unknown and modeled by a feature-based Multi-nomial Logit (MNL) function. At each time, a scheduler assigns jobs to servers, and each server stochastically serves at most one job based on its preferences over the assigned jobs. The primary goal of the algorithm is to stabilize the queues in the system while learning the service rates of servers. To achieve this goal, we propose algorithms based on UCB and Thompson Sampling, which achieve system stability with an average queue length bound of $O(\min\{N,K\}/ε)$ for a large time horizon $T$, where $ε$ is a traffic slackness of the system. Furthermore, the algorithms achieve sublinear regret bounds of $\tilde{O}(\min\{\sqrt{T} Q_{\max},T^{3/4}\})$, where $Q_{\max}$ represents the maximum queue length over agents and times. Lastly, we provide experimental results to demonstrate the performance of our algorithms.
