Bandits with Single-Peaked Preferences and Limited Resources

Gur Keinan; Rotem Torkan; Omer Ben-Porat

Bandits with Single-Peaked Preferences and Limited Resources

Gur Keinan, Rotem Torkan, Omer Ben-Porat

TL;DR

The paper studies online budgeted matching with U users and K arms under a budget constraint, showing general offline optimization is NP-hard. By imposing single-peaked (SP) preferences, it develops efficient SP-specific algorithms: an offline SP-Matching that runs in $O(K^2U+K^2B)$, and online methods with sublinear regret. When the SP order is known (and peaks are known or identifiable), a UCB-based Maximal-Matrix approach (MvM) achieves $\tilde{O}(U\sqrt{TK})$ regret; without known order, the EMC algorithm combines order extraction via PQ-trees with an SP-projected offline solver to obtain $\tilde{O}(UKT^{2/3})$ regret. The work also proves lower bounds showing inherent difficulty persists even under SP, clarifying the value of structure in turning intractable problems into solvable ones and highlighting future directions for richer SP-like structures.

Abstract

We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.

Bandits with Single-Peaked Preferences and Limited Resources

TL;DR

, and online methods with sublinear regret. When the SP order is known (and peaks are known or identifiable), a UCB-based Maximal-Matrix approach (MvM) achieves

regret; without known order, the EMC algorithm combines order extraction via PQ-trees with an SP-projected offline solver to obtain

regret. The work also proves lower bounds showing inherent difficulty persists even under SP, clarifying the value of structure in turning intractable problems into solvable ones and highlighting future directions for richer SP-like structures.

Abstract

We study an online stochastic matching problem in which an algorithm sequentially matches

users to

arms, aiming to maximize cumulative reward over

rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of

. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of

Bandits with Single-Peaked Preferences and Limited Resources

TL;DR

Abstract

Bandits with Single-Peaked Preferences and Limited Resources

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (51)