Bandit Learning in Matching Markets with Interviews

Amirmahdi Mirfakhar; Xuchuang Wang; Mengfan Xu; Hedyeh Beyhaghi; Mohammad Hajiesmaili

Bandit Learning in Matching Markets with Interviews

Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, Mohammad Hajiesmaili

TL;DR

The paper studies horizon-independent learning in two-sided matching markets where participants perform a limited number of interviews to reveal partial preferences, and firms may be uncertain about their own rankings. It introduces a deferral option for firms to hedge against misrankings and designs centralized and decentralized algorithms that achieve time-independent regret under minimal firm-side feedback, with strong performance in structured markets. Centralized learning uses a CIA that coordinates interviews and applies Gale–Shapley on estimated lists to converge to the agent-optimal stable matching at rate $O(nm^2)$. In decentralized settings, the authors propose a strategic-rejection policy and two feedback models (vacancy-only and anonymous hiring changes) to achieve horizon-independent regret $O(n^3m^2)$–$O(n^4m^2)$ in structured markets, and scalable but larger bounds in general markets, with further improvements under a 3-interview extension. The work advances learning-to-stability by incorporating interviews, firm uncertainty, and minimal public signals, offering practical insights for decentralized labor-market platforms and matching markets with limited information sharing.

Abstract

Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textit{low-cost hints} that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm's action space to allow \emph{strategic deferral} (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the $O(\log T)$ regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.

Bandit Learning in Matching Markets with Interviews

TL;DR

. In decentralized settings, the authors propose a strategic-rejection policy and two feedback models (vacancy-only and anonymous hiring changes) to achieve horizon-independent regret

–

in structured markets, and scalable but larger bounds in general markets, with further improvements under a 3-interview extension. The work advances learning-to-stability by incorporating interviews, firm uncertainty, and minimal public signals, offering practical insights for decentralized labor-market platforms and matching markets with limited information sharing.

Abstract

regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.

Paper Structure (62 sections, 24 theorems, 95 equations, 1 table, 7 algorithms)

This paper contains 62 sections, 24 theorems, 95 equations, 1 table, 7 algorithms.

Introduction
Model
Extended Action Space for Firms' Uncertainty
Algorithmic Paradigms and Preliminaries
Centralized Learning
Decentralized Learning
Strategic Firm's Rejection Policy
Coordinated Decentralized Algorithm with Only Vacancy $\mathcal{V}(t)$ as Feedback
Coordination-Free Decentralized Algorithm with Anonymous Hiring Changes $\mathcal{V}^+(t)$ as Feedback
Conclusion
General Notation, Lemmas, and Observations for the Regret Analysis
Top-$k$ Ground-Truth and Estimated Preferences
Rounds of Top-$k$ Alignment
Rejection Variables Update Pseudo-Codes (Algorithms \ref{['alg:fdrr']}, \ref{['alg:drr']},\ref{['alg:ancdrr']}, and \ref{['alg:Eancdrr']})
Deferred Concepts from the Model (Section \ref{['sec:model']}) and Preliminaries (Section \ref{['sec:preliminaries']})
...and 47 more sections

Key Result

Theorem 4.1

In a matching market $\mathcal{M}(\mathcal{A},\mathcal{F})$ with non-strategic firms, the expected optimal regret of agent $a$ under alg:ciarr is $\mathbb{E}[\overline{R}_a(\mathcal{T})]\in O(nm^2).$

Theorems & Definitions (62)

Definition 2.1
Example 2.2: Why abstention is necessary
Theorem 4.1
Definition 4.2: Valid Preference Lists
Lemma 4.2
Theorem 5.1
Theorem 5.2
Theorem 5.3
Theorem 5.4
Definition A.1: Top-$k$ agents and firms
...and 52 more

Bandit Learning in Matching Markets with Interviews

TL;DR

Abstract

Bandit Learning in Matching Markets with Interviews

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (62)