Optimal Analysis for Bandit Learning in Matching Markets with Serial Dictatorship
Zilong Wang, Shuai Li
TL;DR
This work studies bandit learning in two-sided matching markets under serial dictatorship, proving a fundamental lower bound and introducing a multi-level successive selection algorithm that matches this bound with a decentralized, hierarchical exploration scheme. The algorithm assigns a strict priority order among players and progressively eliminates arms for lower-ranked players, relying on elimination and UCB subroutines to bound suboptimal pulls. Theoretical analysis yields a stable regret bound of $O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}{Δ} \right)$, matching the lower bound and closing the gap with prior work. The results provide both practical decentralized strategies and deeper insight into how hierarchical structures can optimize exploration in matching markets, with future work aimed at extending beyond serial dictatorship.
Abstract
The problem of two-sided matching markets is well-studied in computer science and economics, owing to its diverse applications across numerous domains. Since market participants are usually uncertain about their preferences in various online matching platforms, an emerging line of research is dedicated to the online setting where one-side participants (players) learn their unknown preferences through multiple rounds of interactions with the other side (arms). Sankararaman et al. provide an $Ω\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret lower bound for this problem under serial dictatorship assumption, where $N$ is the number of players, $K (\geq N)$ is the number of arms, $Δ$ is the minimum reward gap across players and arms, and $T$ is the time horizon. Serial dictatorship assumes arms have the same preferences, which is common in reality when one side participants have a unified evaluation standard. Recently, the work of Kong and Li proposes the ET-GS algorithm and achieves an $O\left( \frac{K\log(T)}{Δ^2} \right)$ regret upper bound, which is the best upper bound attained so far. Nonetheless, a gap between the lower and upper bounds, ranging from $N$ to $K$, persists. It remains unclear whether the lower bound or the upper bound needs to be improved. In this paper, we propose a multi-level successive selection algorithm that obtains an $O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret bound when the market satisfies serial dictatorship. To the best of our knowledge, we are the first to propose an algorithm that matches the lower bound in the problem of matching markets with bandits.
