Bandit Learning in Matching Markets: Utilitarian and Rawlsian Perspectives

Hadi Hosseini; Duohan Zhang

Bandit Learning in Matching Markets: Utilitarian and Rawlsian Perspectives

Hadi Hosseini, Duohan Zhang

TL;DR

The paper tackles learning in two-sided matching markets where preferences are unknown. It introduces two welfare-centric objectives—utilitarian welfare (sum of all utilities) and Rawlsian maximin welfare (worst-off utility)—and develops epoch Explore-Then-Commit algorithms to learn stable matchings under these criteria while maintaining stability. For the utilitarian objective, it shows a regret bound of $\tilde{O}(N^2\log(T))$, and for the maximin objective a bound of $\tilde{O}(N\log(T))$, leveraging within-side and cross-side preference gaps to guide analysis. The work blends classical stable matching techniques (DA, rotations, min-cut) with bandit learning, providing both algorithmic prescriptions and theoretical guarantees, and validates performance through simulations that demonstrate learning stability and welfare improvements in learning-rich matching markets.

Abstract

Two-sided matching markets have demonstrated significant impact in many real-world applications, including school choice, medical residency placement, electric vehicle charging, ride sharing, and recommender systems. However, traditional models often assume that preferences are known, which is not always the case in modern markets, where preferences are unknown and must be learned. For example, a company may not know its preference over all job applicants a priori in online markets. Recent research has modeled matching markets as multi-armed bandit (MAB) problem and primarily focused on optimizing matching for one side of the market, while often resulting in a pessimal solution for the other side. In this paper, we adopt a welfarist approach for both sides of the market, focusing on two metrics: (1) Utilitarian welfare and (2) Rawlsian welfare, while maintaining market stability. For these metrics, we propose algorithms based on epoch Explore-Then-Commit (ETC) and analyze their regret bounds. Finally, we conduct simulated experiments to evaluate both welfare and market stability.

Bandit Learning in Matching Markets: Utilitarian and Rawlsian Perspectives

TL;DR

Abstract

Bandit Learning in Matching Markets: Utilitarian and Rawlsian Perspectives

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (32)