Table of Contents
Fetching ...

BanditQ: Fair Bandits with Guaranteed Rewards

Abhishek Sinha

TL;DR

BanditQ addresses fair bandits with guaranteed per-arm reward rates in a stochastic setting by coupling a queueing dynamics to an adversarial MAB reduction. It constructs surrogate rewards ${r'_i(t)=(Q_i(t-1)+V) r_i(t)}$ and solves an online linear optimization subproblem, enabling a universal BanditQ reduction that preserves regret guarantees while satisfying rate constraints. In the full-information setting, Regret$_T(\\bm{x}^*)=O\left(\max\left(\frac{T}{\sqrt{V}}, \sqrt{NT}\right)\right)$ and ${\\mathbb{V}}(T)=O(\sqrt{VT})$, which with $V=\sqrt{T}$ yields Regret$_T=O(\max(T^{3/4}, \sqrt{NT}))$ and ${\\mathbb{V}}(T)=O(T^{3/4})$; a monotonicity assumption or time-averaged regret further improves the average regret to $O(\sqrt{NT})$. In the bandit setting, a scale-free MAB subroutine yields Regret$_T=\\tilde{O}(\\max( T\sqrt{N}/\sqrt{V}, N^{3/4} T^{5/4}/V, N\sqrt{T}))$ and ${\\mathbb{V}}(T)=\\tilde{O}(\\max(\sqrt{VT}, N^{1/4} T^{3/4}))$, with $V=\sqrt{T}$ giving regret $O(N^{3/4} T^{3/4})$ and rate violation $\\tilde{O}(N^{1/4} T^{3/4})$. Empirical results confirm that BanditQ achieves target reward rates for protected arms and outperforms fairness-baseline policies, including in large-scale settings with thousands of arms.

Abstract

Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.

BanditQ: Fair Bandits with Guaranteed Rewards

TL;DR

BanditQ addresses fair bandits with guaranteed per-arm reward rates in a stochastic setting by coupling a queueing dynamics to an adversarial MAB reduction. It constructs surrogate rewards and solves an online linear optimization subproblem, enabling a universal BanditQ reduction that preserves regret guarantees while satisfying rate constraints. In the full-information setting, Regret and , which with yields Regret and ; a monotonicity assumption or time-averaged regret further improves the average regret to . In the bandit setting, a scale-free MAB subroutine yields Regret and , with giving regret and rate violation . Empirical results confirm that BanditQ achieves target reward rates for protected arms and outperforms fairness-baseline policies, including in large-scale settings with thousands of arms.

Abstract

Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty of at most The regret bound in the full-information setting can be further improved to under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.
Paper Structure (38 sections, 13 theorems, 80 equations, 13 figures, 3 algorithms)

This paper contains 38 sections, 13 theorems, 80 equations, 13 figures, 3 algorithms.

Key Result

Theorem 1

Let $X \subseteq \mathbb{R}^{d}$ be a convex set with a finite Euclidean diameter $D.$ Consider an arbitrary sequence of linear reward functions with gradients $\{\bm{g}_t\}_{t \geq 1}.$ Assume that the Online Gradient Ascent policy is run with step sizesWithout any loss of generality, we set $\eta_

Figures (13)

  • Figure 1: Reward accrual rates in the full-information setting
  • Figure 2: Queue lengths in the full-information setting
  • Figure 3: Regret of BanditQ in the full-information setting
  • Figure 4: Reward accrual rates in the bandit feedback
  • Figure 5: Queue lengths in the bandit feedback setting
  • ...and 8 more figures

Theorems & Definitions (19)

  • Theorem 1: orabona2019modern, Theorem 4.14
  • Theorem 2
  • proof
  • Proposition 1
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Theorem 3
  • Theorem 4: putta2022scale
  • ...and 9 more