Stochastic Matching Bandits with Rare Optimization Updates
Jung-hun Kim, Min-hwan Oh
TL;DR
This work studies stochastic matching bandits (SMB) where multiple agents may be assigned to each arm and the arm probabilistically accepts one agent according to latent MNL preferences, all with the goal of maximizing cumulative revenue. To overcome NP-hard per-round optimization, it introduces batched elimination-based learning (B-SMB) that updates assignments only a doubly-logarithmic number of times, while preserving a $\widetilde{O}(\sqrt{T})$ regret. A parameter-free variant removes the need to know the nonlinearity parameter $\kappa$, maintaining the same regret rate and rare-update property. The approach combines SVD-based feature reduction, MLE estimation, UCB/LCB-guided exploration, and G-/D-optimal design to manage the exponential combinatorial action space, with experiments showing substantial computational savings and competitive performance in realistic settings.
Abstract
We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $Θ(\log\log T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$.
