Table of Contents
Fetching ...

Optimal Analysis for Bandit Learning in Matching Markets with Serial Dictatorship

Zilong Wang, Shuai Li

TL;DR

This work studies bandit learning in two-sided matching markets under serial dictatorship, proving a fundamental lower bound and introducing a multi-level successive selection algorithm that matches this bound with a decentralized, hierarchical exploration scheme. The algorithm assigns a strict priority order among players and progressively eliminates arms for lower-ranked players, relying on elimination and UCB subroutines to bound suboptimal pulls. Theoretical analysis yields a stable regret bound of $O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}{Δ} \right)$, matching the lower bound and closing the gap with prior work. The results provide both practical decentralized strategies and deeper insight into how hierarchical structures can optimize exploration in matching markets, with future work aimed at extending beyond serial dictatorship.

Abstract

The problem of two-sided matching markets is well-studied in computer science and economics, owing to its diverse applications across numerous domains. Since market participants are usually uncertain about their preferences in various online matching platforms, an emerging line of research is dedicated to the online setting where one-side participants (players) learn their unknown preferences through multiple rounds of interactions with the other side (arms). Sankararaman et al. provide an $Ω\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret lower bound for this problem under serial dictatorship assumption, where $N$ is the number of players, $K (\geq N)$ is the number of arms, $Δ$ is the minimum reward gap across players and arms, and $T$ is the time horizon. Serial dictatorship assumes arms have the same preferences, which is common in reality when one side participants have a unified evaluation standard. Recently, the work of Kong and Li proposes the ET-GS algorithm and achieves an $O\left( \frac{K\log(T)}{Δ^2} \right)$ regret upper bound, which is the best upper bound attained so far. Nonetheless, a gap between the lower and upper bounds, ranging from $N$ to $K$, persists. It remains unclear whether the lower bound or the upper bound needs to be improved. In this paper, we propose a multi-level successive selection algorithm that obtains an $O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret bound when the market satisfies serial dictatorship. To the best of our knowledge, we are the first to propose an algorithm that matches the lower bound in the problem of matching markets with bandits.

Optimal Analysis for Bandit Learning in Matching Markets with Serial Dictatorship

TL;DR

This work studies bandit learning in two-sided matching markets under serial dictatorship, proving a fundamental lower bound and introducing a multi-level successive selection algorithm that matches this bound with a decentralized, hierarchical exploration scheme. The algorithm assigns a strict priority order among players and progressively eliminates arms for lower-ranked players, relying on elimination and UCB subroutines to bound suboptimal pulls. Theoretical analysis yields a stable regret bound of , matching the lower bound and closing the gap with prior work. The results provide both practical decentralized strategies and deeper insight into how hierarchical structures can optimize exploration in matching markets, with future work aimed at extending beyond serial dictatorship.

Abstract

The problem of two-sided matching markets is well-studied in computer science and economics, owing to its diverse applications across numerous domains. Since market participants are usually uncertain about their preferences in various online matching platforms, an emerging line of research is dedicated to the online setting where one-side participants (players) learn their unknown preferences through multiple rounds of interactions with the other side (arms). Sankararaman et al. provide an regret lower bound for this problem under serial dictatorship assumption, where is the number of players, is the number of arms, is the minimum reward gap across players and arms, and is the time horizon. Serial dictatorship assumes arms have the same preferences, which is common in reality when one side participants have a unified evaluation standard. Recently, the work of Kong and Li proposes the ET-GS algorithm and achieves an regret upper bound, which is the best upper bound attained so far. Nonetheless, a gap between the lower and upper bounds, ranging from to , persists. It remains unclear whether the lower bound or the upper bound needs to be improved. In this paper, we propose a multi-level successive selection algorithm that obtains an regret bound when the market satisfies serial dictatorship. To the best of our knowledge, we are the first to propose an algorithm that matches the lower bound in the problem of matching markets with bandits.

Paper Structure

This paper contains 16 sections, 9 theorems, 26 equations, 2 figures, 1 table, 6 algorithms.

Key Result

Theorem 1

Following the multi-level successive selection algorithm (Algorithm alg:main) with elimination subroutine, the stable regret of player $p_i \in \mathcal{N}$ satisfies

Figures (2)

  • Figure 1: A case of $3$ players and $5$ arms with reward chosen i.i.d. uniform in $[0,1]$. Over $10$ independent runs with horizon $T=100,000$. This heat map counts the number of times the player pulls each arm in every $1,000$ rounds. The color intensity increases with the number of times the arm is selected.
  • Figure 2: multi-level blocking case

Theorems & Definitions (16)

  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • ...and 6 more