BanditQ: Fair Bandits with Guaranteed Rewards

Abhishek Sinha

BanditQ: Fair Bandits with Guaranteed Rewards

Abhishek Sinha

TL;DR

BanditQ addresses fair bandits with guaranteed per-arm reward rates in a stochastic setting by coupling a queueing dynamics to an adversarial MAB reduction. It constructs surrogate rewards ${r'_i(t)=(Q_i(t-1)+V) r_i(t)}$ and solves an online linear optimization subproblem, enabling a universal BanditQ reduction that preserves regret guarantees while satisfying rate constraints. In the full-information setting, Regret$_T(\\bm{x}^*)=O\left(\max\left(\frac{T}{\sqrt{V}}, \sqrt{NT}\right)\right)$ and ${\\mathbb{V}}(T)=O(\sqrt{VT})$, which with $V=\sqrt{T}$ yields Regret$_T=O(\max(T^{3/4}, \sqrt{NT}))$ and ${\\mathbb{V}}(T)=O(T^{3/4})$; a monotonicity assumption or time-averaged regret further improves the average regret to $O(\sqrt{NT})$. In the bandit setting, a scale-free MAB subroutine yields Regret$_T=\\tilde{O}(\\max( T\sqrt{N}/\sqrt{V}, N^{3/4} T^{5/4}/V, N\sqrt{T}))$ and ${\\mathbb{V}}(T)=\\tilde{O}(\\max(\sqrt{VT}, N^{1/4} T^{3/4}))$, with $V=\sqrt{T}$ giving regret $O(N^{3/4} T^{3/4})$ and rate violation $\\tilde{O}(N^{1/4} T^{3/4})$. Empirical results confirm that BanditQ achieves target reward rates for protected arms and outperforms fairness-baseline policies, including in large-scale settings with thousands of arms.

Abstract

Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.

BanditQ: Fair Bandits with Guaranteed Rewards

TL;DR

BanditQ addresses fair bandits with guaranteed per-arm reward rates in a stochastic setting by coupling a queueing dynamics to an adversarial MAB reduction. It constructs surrogate rewards

and solves an online linear optimization subproblem, enabling a universal BanditQ reduction that preserves regret guarantees while satisfying rate constraints. In the full-information setting, Regret

and

, which with

yields Regret

and

; a monotonicity assumption or time-averaged regret further improves the average regret to

. In the bandit setting, a scale-free MAB subroutine yields Regret

and

, with

giving regret

and rate violation

. Empirical results confirm that BanditQ achieves target reward rates for protected arms and outperforms fairness-baseline policies, including in large-scale settings with thousands of arms.

Abstract

The regret bound in the full-information setting can be further improved to

under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.

Paper Structure (38 sections, 13 theorems, 80 equations, 13 figures, 3 algorithms)

This paper contains 38 sections, 13 theorems, 80 equations, 13 figures, 3 algorithms.

Introduction
Related Works
Our contributions
Problem formulation
Fairness constraints:
Offline Benchmark and Performance Metric:
BanditQ Policy with full information feedback
The BanditQ policy:
Analysis
1 (a). Bounding the queue lengths:
1 (b). Bounding the rate violation penalty $\mathbb{V}(T)$:
2. Bounding the regret:
Remarks:
Sharper regret bound under a monotonicity assumption:
BanditQ policy with bandit feedback
...and 23 more sections

Key Result

Theorem 1

Let $X \subseteq \mathbb{R}^{d}$ be a convex set with a finite Euclidean diameter $D.$ Consider an arbitrary sequence of linear reward functions with gradients $\{\bm{g}_t\}_{t \geq 1}.$ Assume that the Online Gradient Ascent policy is run with step sizesWithout any loss of generality, we set $\eta_

Figures (13)

Figure 1: Reward accrual rates in the full-information setting
Figure 2: Queue lengths in the full-information setting
Figure 3: Regret of BanditQ in the full-information setting
Figure 4: Reward accrual rates in the bandit feedback
Figure 5: Queue lengths in the bandit feedback setting
...and 8 more figures

Theorems & Definitions (19)

Theorem 1: orabona2019modern, Theorem 4.14
Theorem 2
proof
Proposition 1
Proposition 2
proof
Proposition 3
proof
Theorem 3
Theorem 4: putta2022scale
...and 9 more

BanditQ: Fair Bandits with Guaranteed Rewards

TL;DR

Abstract

BanditQ: Fair Bandits with Guaranteed Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (19)