Table of Contents
Fetching ...

Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

Ishank Juneja, Carlee Joe-Wong, Osman Yağan

TL;DR

The paper broadens the MAB-CS framework by adding fixed-threshold and known-reference-arm settings, and introduces Pairwise Elimination (PE) and its subsidized variant PE-CS. It establishes instance-dependent lower bounds and proves logarithmic upper bounds on both cost and quality regret, with PE being order-optimal in the known-reference-arm case. PE-CS extends PE to the subsidized best reward setting via a Best-Arm Identification stage, achieving competitive regret guarantees and favorable empirical performance on real datasets. Together, these results advance both theory and practice for cost-aware bandits with reward constraints, offering robust, data-driven strategies for cost-efficient decision making under uncertainty.

Abstract

Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default'' decision, with as low a cost as possible. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and the rewards are unknown. In our work, we address variants of MAB-CS including ones with reward constrained by the reward of a known reference arm or by the subsidized best reward. We introduce the Pairwise-Elimination (PE) algorithm for the known reference arm variant and generalize PE to PE-CS for the subsidized best reward variant. Our instance-dependent analysis of PE and PE-CS reveals that both algorithms have an order-wise logarithmic upper bound on Cost and Quality Regret, making our policies the first with such a guarantee. Moreover, by comparing our upper and lower bound results we establish that PE is order-optimal for all known reference arm problem instances. Finally, experiments are conducted using the MovieLens 25M and Goodreads datasets for both PE and PE-CS revealing the effectiveness of PE and the superior balance between performance and reliability offered by PE-CS compared to baselines from the literature.

Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

TL;DR

The paper broadens the MAB-CS framework by adding fixed-threshold and known-reference-arm settings, and introduces Pairwise Elimination (PE) and its subsidized variant PE-CS. It establishes instance-dependent lower bounds and proves logarithmic upper bounds on both cost and quality regret, with PE being order-optimal in the known-reference-arm case. PE-CS extends PE to the subsidized best reward setting via a Best-Arm Identification stage, achieving competitive regret guarantees and favorable empirical performance on real datasets. Together, these results advance both theory and practice for cost-aware bandits with reward constraints, offering robust, data-driven strategies for cost-efficient decision making under uncertainty.

Abstract

Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default'' decision, with as low a cost as possible. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and the rewards are unknown. In our work, we address variants of MAB-CS including ones with reward constrained by the reward of a known reference arm or by the subsidized best reward. We introduce the Pairwise-Elimination (PE) algorithm for the known reference arm variant and generalize PE to PE-CS for the subsidized best reward variant. Our instance-dependent analysis of PE and PE-CS reveals that both algorithms have an order-wise logarithmic upper bound on Cost and Quality Regret, making our policies the first with such a guarantee. Moreover, by comparing our upper and lower bound results we establish that PE is order-optimal for all known reference arm problem instances. Finally, experiments are conducted using the MovieLens 25M and Goodreads datasets for both PE and PE-CS revealing the effectiveness of PE and the superior balance between performance and reliability offered by PE-CS compared to baselines from the literature.
Paper Structure (21 sections, 31 theorems, 157 equations, 9 figures, 2 tables, 10 algorithms)

This paper contains 21 sections, 31 theorems, 157 equations, 9 figures, 2 tables, 10 algorithms.

Key Result

Theorem 3.1

Under any consistent policy $\pi$ the expected number of samples of a low cost arm and of the reference arm $\ell$ are lower bounded as, Where $T$ denotes the problem horizon and the rewards of all $K$ arms are Gaussian distributed with variance $\sigma^2 = 1$. Low cost is a term used relative to the cost of optimal arm $a^*$.

Figures (9)

  • Figure 1: Fig. 1(a) varies the index $\ell$ for MovieLens. Fig. 1(b) does the same for Goodreads ($\alpha = 0$ in both). Fig. 1(c) fixes $\ell = 11$ and varies $\alpha$ for MovieLens while Fig. 1(d) fixes $\ell = 4$ for Goodreads and varies $\alpha$. Data points represent terminal regret at $T =$5M and each data point represents the outcome from an experiment. There are 25 such independent runs for each algorithm. There is no inherent notion of a reference arm in either dataset, so $\ell$ is picked arbitrarily.
  • Figure 2: Fig. 2(a) shows the regret trend for MovieLens and Fig. 2(c) does the same for Goodreads. Both are for $\alpha=0.25$. Similar to Fig. 1(c) and 1(d), Fig. 2(b) and 2(d) show the terminal regret trend ($T =$5M, 50 runs of each algorithm). UCB-CS is omitted from (a) and (c) as its regret was orders of magnitude worse as can be seen from Fig. 2(b), 2(d).
  • Figure 3: Problem instance for MovieLens 25M experiments
  • Figure 4: Problem instance for Goodreads experiment
  • Figure 5: Trade off between cost and quality regret for subsidized best reward setting with MovieLens (left) and Goodreads (right). The data used for visualization remain the same as the experiment in Figure \ref{['fig:pecs_combined']}, Section \ref{['sec:experiments']}.
  • ...and 4 more figures

Theorems & Definitions (75)

  • Theorem 3.1: Lower bound for known reference arm setting
  • Theorem 3.2: Instance dependent upper bound on cumulative cost and quality regret for PE
  • Theorem 3.3: Lower bound for subsidized best reward setting
  • Theorem 3.4: Instance dependent upper bound on cumulative cost and quality regret for PE-CS
  • Definition D.1: Subgaussian Random Variable
  • Lemma D.1: Bounded random variables are Subgaussian, Example 5.6(c) in lattimore2020bandit
  • Lemma D.2: Hoeffding Bound, Section 5.4 in lattimore2020bandit
  • Lemma D.3: Iterated expectation over mutually exclusive and exhaustive events
  • proof
  • Lemma D.4: Expectation is at most equal to larger of the conditioned expectations
  • ...and 65 more