Table of Contents
Fetching ...

Stochastic Matching Bandits with Rare Optimization Updates

Jung-hun Kim, Min-hwan Oh

TL;DR

This work studies stochastic matching bandits (SMB) where multiple agents may be assigned to each arm and the arm probabilistically accepts one agent according to latent MNL preferences, all with the goal of maximizing cumulative revenue. To overcome NP-hard per-round optimization, it introduces batched elimination-based learning (B-SMB) that updates assignments only a doubly-logarithmic number of times, while preserving a $\widetilde{O}(\sqrt{T})$ regret. A parameter-free variant removes the need to know the nonlinearity parameter $\kappa$, maintaining the same regret rate and rare-update property. The approach combines SVD-based feature reduction, MLE estimation, UCB/LCB-guided exploration, and G-/D-optimal design to manage the exponential combinatorial action space, with experiments showing substantial computational savings and competitive performance in realistic settings.

Abstract

We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $Θ(\log\log T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$.

Stochastic Matching Bandits with Rare Optimization Updates

TL;DR

This work studies stochastic matching bandits (SMB) where multiple agents may be assigned to each arm and the arm probabilistically accepts one agent according to latent MNL preferences, all with the goal of maximizing cumulative revenue. To overcome NP-hard per-round optimization, it introduces batched elimination-based learning (B-SMB) that updates assignments only a doubly-logarithmic number of times, while preserving a regret. A parameter-free variant removes the need to know the nonlinearity parameter , maintaining the same regret rate and rare-update property. The approach combines SVD-based feature reduction, MLE estimation, UCB/LCB-guided exploration, and G-/D-optimal design to manage the exponential combinatorial action space, with experiments showing substantial computational savings and competitive performance in realistic settings.

Abstract

We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, agents on one side are assigned to arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon . The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of .

Paper Structure

This paper contains 44 sections, 28 theorems, 137 equations, 6 figures, 3 algorithms.

Key Result

Proposition 5.1

$\tau_T\le M$.

Figures (6)

  • Figure 1: Illustration of the stochastic matching process with 4 agents ($N = 4$) and 3 arms ($K = 3$).
  • Figure 2: Results for $N=3$, $K=2$: (left) cumulative optimization updates, (middle) runtime, (right) cumulative regret.
  • Figure 3: Results for $N=7$, $K=4$: (left) cumulative optimization updates, (middle) runtime, (right) cumulative regret.
  • Figure 4: Cardinality of the active assignment set $\mathcal{M}_\tau$ over times (left) $N=3, K=2$ and (right) $N=7$, $K=4$.
  • Figure 5: Experimental results with $N=8$ and $K=4$ for (left) runtime cost and (right) regret of algorithms. Notably, increasing $N$ from 7 to 8 (as opposed to Figure \ref{['fig:exp']}) causes the runtime of OFU-MNL$^+$ to exceed 15,000 seconds—up from 5,000 seconds—whereas our algorithms maintain runtimes under 1,000 seconds. In terms of regret performance, our algorithms achieve results comparable to OFU-MNL$^+$.
  • ...and 1 more figures

Theorems & Definitions (50)

  • Remark 4.1
  • Proposition 5.1: Number of Batch Updates
  • Theorem 5.2
  • Corollary 5.3
  • Remark 5.4: Efficiency via Rare Optimization Updates
  • Remark 6.1
  • Remark 6.2
  • Proposition 6.3: Number of Batch Updates
  • Theorem 6.4
  • Corollary 6.5
  • ...and 40 more