Table of Contents
Fetching ...

Bandit Learning in Matching Markets with Interviews

Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, Mohammad Hajiesmaili

TL;DR

The paper studies horizon-independent learning in two-sided matching markets where participants perform a limited number of interviews to reveal partial preferences, and firms may be uncertain about their own rankings. It introduces a deferral option for firms to hedge against misrankings and designs centralized and decentralized algorithms that achieve time-independent regret under minimal firm-side feedback, with strong performance in structured markets. Centralized learning uses a CIA that coordinates interviews and applies Gale–Shapley on estimated lists to converge to the agent-optimal stable matching at rate $O(nm^2)$. In decentralized settings, the authors propose a strategic-rejection policy and two feedback models (vacancy-only and anonymous hiring changes) to achieve horizon-independent regret $O(n^3m^2)$–$O(n^4m^2)$ in structured markets, and scalable but larger bounds in general markets, with further improvements under a 3-interview extension. The work advances learning-to-stability by incorporating interviews, firm uncertainty, and minimal public signals, offering practical insights for decentralized labor-market platforms and matching markets with limited information sharing.

Abstract

Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textit{low-cost hints} that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm's action space to allow \emph{strategic deferral} (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the $O(\log T)$ regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.

Bandit Learning in Matching Markets with Interviews

TL;DR

The paper studies horizon-independent learning in two-sided matching markets where participants perform a limited number of interviews to reveal partial preferences, and firms may be uncertain about their own rankings. It introduces a deferral option for firms to hedge against misrankings and designs centralized and decentralized algorithms that achieve time-independent regret under minimal firm-side feedback, with strong performance in structured markets. Centralized learning uses a CIA that coordinates interviews and applies Gale–Shapley on estimated lists to converge to the agent-optimal stable matching at rate . In decentralized settings, the authors propose a strategic-rejection policy and two feedback models (vacancy-only and anonymous hiring changes) to achieve horizon-independent regret in structured markets, and scalable but larger bounds in general markets, with further improvements under a 3-interview extension. The work advances learning-to-stability by incorporating interviews, firm uncertainty, and minimal public signals, offering practical insights for decentralized labor-market platforms and matching markets with limited information sharing.

Abstract

Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textit{low-cost hints} that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm's action space to allow \emph{strategic deferral} (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.
Paper Structure (62 sections, 24 theorems, 95 equations, 1 table, 7 algorithms)

This paper contains 62 sections, 24 theorems, 95 equations, 1 table, 7 algorithms.

Key Result

Theorem 4.1

In a matching market $\mathcal{M}(\mathcal{A},\mathcal{F})$ with non-strategic firms, the expected optimal regret of agent $a$ under alg:ciarr is $\mathbb{E}[\overline{R}_a(\mathcal{T})]\in O(nm^2).$

Theorems & Definitions (62)

  • Definition 2.1
  • Example 2.2: Why abstention is necessary
  • Theorem 4.1
  • Definition 4.2: Valid Preference Lists
  • Lemma 4.2
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 5.4
  • Definition A.1: Top-$k$ agents and firms
  • ...and 52 more