Table of Contents
Fetching ...

p-Mean Regret for Stochastic Bandits

Anand Krishna, Philips George John, Adarsh Barik, Vincent Y. F. Tan

TL;DR

This work extends the $p$-mean welfare concept to stochastic multi-armed bandits by introducing $p$-mean regret, enabling a continuum between fairness and efficiency through the parameter $p$ and unifying average and Nash regret. It proposes a simple two-phase Explore-Then-UCB algorithm that first performs calibrated uniform exploration and then runs UCB1, achieving sharp regret guarantees across $p$ in $(-\infty,1]$ under a mild positive-reward assumption. The bounds show distinct regimes: $\tilde{O}\left(\sqrt{\frac{k}{T^{1/(2|p|)}}}\right)$ for $p\le-1$, $\tilde{O}\left(\sqrt{\frac{k^{3/2}}{T^{1/2}}}\right)$ for $-1<p<0$, and $\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$ for $0<p\le1$, with Nash regret emerging as $p\to0$ and matching prior results up to constants. This unified approach simplifies design choices for fairness-aware bandit learning and provides scalable guarantees, motivating extensions to contextual/linear settings and alternative meta-algorithms. The work thus offers a principled, flexible framework for fairness-conscious decision-making in sequential learning tasks with practical applications in resource allocation and social welfare.

Abstract

In this work, we extend the concept of the $p$-mean welfare objective from social choice theory (Moulin 2004) to study $p$-mean regret in stochastic multi-armed bandit problems. The $p$-mean regret, defined as the difference between the optimal mean among the arms and the $p$-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter $p$. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel $p$-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a $p$-mean regret bound of $\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right)$ for all $p \leq -1$, where $k$ represents the number of arms and $T$ the time horizon. When $-1<p<0$, we achieve a regret bound of $\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right)$. For the range $0< p \leq 1$, we achieve a $p$-mean regret scaling as $\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$, which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the $p$-mean regret of any algorithm is at least its average cumulative regret for $p \leq 1$. In the case of Nash regret (the limit as $p$ approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.

p-Mean Regret for Stochastic Bandits

TL;DR

This work extends the -mean welfare concept to stochastic multi-armed bandits by introducing -mean regret, enabling a continuum between fairness and efficiency through the parameter and unifying average and Nash regret. It proposes a simple two-phase Explore-Then-UCB algorithm that first performs calibrated uniform exploration and then runs UCB1, achieving sharp regret guarantees across in under a mild positive-reward assumption. The bounds show distinct regimes: for , for , and for , with Nash regret emerging as and matching prior results up to constants. This unified approach simplifies design choices for fairness-aware bandit learning and provides scalable guarantees, motivating extensions to contextual/linear settings and alternative meta-algorithms. The work thus offers a principled, flexible framework for fairness-conscious decision-making in sequential learning tasks with practical applications in resource allocation and social welfare.

Abstract

In this work, we extend the concept of the -mean welfare objective from social choice theory (Moulin 2004) to study -mean regret in stochastic multi-armed bandit problems. The -mean regret, defined as the difference between the optimal mean among the arms and the -mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter . Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel -mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a -mean regret bound of for all , where represents the number of arms and the time horizon. When , we achieve a regret bound of . For the range , we achieve a -mean regret scaling as , which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the -mean regret of any algorithm is at least its average cumulative regret for . In the case of Nash regret (the limit as approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.

Paper Structure

This paper contains 18 sections, 10 theorems, 68 equations, 1 table, 1 algorithm.

Key Result

Lemma 1

As long as Assumptions assump:sufficiently_positive_rewards and assump:sufficiently_large_exploration are satisfied, $P(G) \geq (1- \frac{2}{T})$, where $G = G_1 \cap G_2$.

Theorems & Definitions (19)

  • Remark 1
  • Lemma 1
  • Lemma 1: UCB correctness
  • Lemma 1: Only good arms in phase two
  • Theorem 2
  • proof
  • Remark 2
  • Theorem 3
  • Theorem 4
  • Remark 2
  • ...and 9 more