Table of Contents
Fetching ...

Bandits with Single-Peaked Preferences and Limited Resources

Gur Keinan, Rotem Torkan, Omer Ben-Porat

TL;DR

The paper studies online budgeted matching with U users and K arms under a budget constraint, showing general offline optimization is NP-hard. By imposing single-peaked (SP) preferences, it develops efficient SP-specific algorithms: an offline SP-Matching that runs in $O(K^2U+K^2B)$, and online methods with sublinear regret. When the SP order is known (and peaks are known or identifiable), a UCB-based Maximal-Matrix approach (MvM) achieves $\tilde{O}(U\sqrt{TK})$ regret; without known order, the EMC algorithm combines order extraction via PQ-trees with an SP-projected offline solver to obtain $\tilde{O}(UKT^{2/3})$ regret. The work also proves lower bounds showing inherent difficulty persists even under SP, clarifying the value of structure in turning intractable problems into solvable ones and highlighting future directions for richer SP-like structures.

Abstract

We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.

Bandits with Single-Peaked Preferences and Limited Resources

TL;DR

The paper studies online budgeted matching with U users and K arms under a budget constraint, showing general offline optimization is NP-hard. By imposing single-peaked (SP) preferences, it develops efficient SP-specific algorithms: an offline SP-Matching that runs in , and online methods with sublinear regret. When the SP order is known (and peaks are known or identifiable), a UCB-based Maximal-Matrix approach (MvM) achieves regret; without known order, the EMC algorithm combines order extraction via PQ-trees with an SP-projected offline solver to obtain regret. The work also proves lower bounds showing inherent difficulty persists even under SP, clarifying the value of structure in turning intractable problems into solvable ones and highlighting future directions for richer SP-like structures.

Abstract

We study an online stochastic matching problem in which an algorithm sequentially matches users to arms, aiming to maximize cumulative reward over rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of . Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of .

Paper Structure

This paper contains 50 sections, 27 theorems, 60 equations, 2 figures, 5 algorithms.

Key Result

Lemma 0

Fix any PSP matrix $\Theta$ with peaks $p(\cdot)$, and any arm subset $S = \{k_1, \ldots, k_m\} \subseteq K$ with $k_1 < \cdots < k_m$. Let $\pi^\star \in \mathop{\mathrm{arg\,max}}\limits_{\pi,\, \mathrm{Im}(\pi) \subseteq S} V(\pi;\Theta)$. For any user $u$, if $k_j \leq p(u) \leq k_{j+1}$ for som

Figures (2)

  • Figure 1: Illustration of PSP and SP instances. Each subfigure consists of curves representing the expected rewards of user-arm pairs. The instance in \ref{['subfig:single-peaked-not-ordered']} is SP, since if we reorder the arms, we get the PSP instance in \ref{['subfig:single-peaked-ordered']} .
  • Figure 2: Log-log plots of cumulative regret versus time for both algorithms. Each plot shows the mean regret over all 10 instances (with 10 runs each) and shaded regions indicating standard deviation. The EMC algorithm (left) achieves a slope of approximately 0.69, approaching the theoretical guarantee of $2/3 \approx 0.67$. The MvM algorithm (right) demonstrates slopes below 0.5, consistent with the theoretical bound.

Theorems & Definitions (51)

  • Definition 1: PSP matrix
  • Definition 2: SP matrix and SP order
  • Lemma 0
  • Theorem 1
  • Definition 3
  • Proposition 1
  • Lemma 1
  • Lemma 1
  • Theorem 2
  • Definition 4: Confidence set
  • ...and 41 more