p-Mean Regret for Stochastic Bandits
Anand Krishna, Philips George John, Adarsh Barik, Vincent Y. F. Tan
TL;DR
This work extends the $p$-mean welfare concept to stochastic multi-armed bandits by introducing $p$-mean regret, enabling a continuum between fairness and efficiency through the parameter $p$ and unifying average and Nash regret. It proposes a simple two-phase Explore-Then-UCB algorithm that first performs calibrated uniform exploration and then runs UCB1, achieving sharp regret guarantees across $p$ in $(-\infty,1]$ under a mild positive-reward assumption. The bounds show distinct regimes: $\tilde{O}\left(\sqrt{\frac{k}{T^{1/(2|p|)}}}\right)$ for $p\le-1$, $\tilde{O}\left(\sqrt{\frac{k^{3/2}}{T^{1/2}}}\right)$ for $-1<p<0$, and $\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$ for $0<p\le1$, with Nash regret emerging as $p\to0$ and matching prior results up to constants. This unified approach simplifies design choices for fairness-aware bandit learning and provides scalable guarantees, motivating extensions to contextual/linear settings and alternative meta-algorithms. The work thus offers a principled, flexible framework for fairness-conscious decision-making in sequential learning tasks with practical applications in resource allocation and social welfare.
Abstract
In this work, we extend the concept of the $p$-mean welfare objective from social choice theory (Moulin 2004) to study $p$-mean regret in stochastic multi-armed bandit problems. The $p$-mean regret, defined as the difference between the optimal mean among the arms and the $p$-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter $p$. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel $p$-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a $p$-mean regret bound of $\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right)$ for all $p \leq -1$, where $k$ represents the number of arms and $T$ the time horizon. When $-1<p<0$, we achieve a regret bound of $\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right)$. For the range $0< p \leq 1$, we achieve a $p$-mean regret scaling as $\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$, which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the $p$-mean regret of any algorithm is at least its average cumulative regret for $p \leq 1$. In the case of Nash regret (the limit as $p$ approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.
