Bandits with Single-Peaked Preferences and Limited Resources
Gur Keinan, Rotem Torkan, Omer Ben-Porat
TL;DR
The paper studies online budgeted matching with U users and K arms under a budget constraint, showing general offline optimization is NP-hard. By imposing single-peaked (SP) preferences, it develops efficient SP-specific algorithms: an offline SP-Matching that runs in $O(K^2U+K^2B)$, and online methods with sublinear regret. When the SP order is known (and peaks are known or identifiable), a UCB-based Maximal-Matrix approach (MvM) achieves $\tilde{O}(U\sqrt{TK})$ regret; without known order, the EMC algorithm combines order extraction via PQ-trees with an SP-projected offline solver to obtain $\tilde{O}(UKT^{2/3})$ regret. The work also proves lower bounds showing inherent difficulty persists even under SP, clarifying the value of structure in turning intractable problems into solvable ones and highlighting future directions for richer SP-like structures.
Abstract
We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
