Fairness of Exposure in Online Restless Multi-armed Bandits

Archit Sood; Shweta Jain; Sujit Gujar

Fairness of Exposure in Online Restless Multi-armed Bandits

Archit Sood, Shweta Jain, Sujit Gujar

TL;DR

The paper tackles fairness in online Restless MABs by introducing Merit Fairness, which ties arm exposure to a merit measure derived from steady-state rewards: $\mu_i = f(P_i,1) - f(P_i,0)$. It proposes MF-RMAB, an online algorithm that learns transition dynamics with a UCB-style confidence bound and allocates pulls according to a merit-based distribution $\pi^t_i \propto g(\mu_i^t)$, ensuring exposure proportional to merit. The key theoretical result shows a high-probability sublinear fairness regret for the single-pull setting: $FR^T = \mathcal{O}\left( \frac{L \sqrt{G T \ln(8N \frac{T}{\delta})}}{\gamma (1-\eta)(1-\omega)} \right)$, with extension to multi-pull scenarios demonstrated empirically. Experiments on synthetic and CPAP-inspired domains confirm that MF-RMAB achieves fair exposure across arms while maintaining competitive performance, illustrating practical impact for equitable online interventions in settings like healthcare. Overall, the work blends steady-state merit, online learning, and fairness constraints to enable fair, scalable RMAB policies with provable guarantees.

Abstract

Restless multi-armed bandits (RMABs) generalize the multi-armed bandits where each arm exhibits Markovian behavior and transitions according to their transition dynamics. Solutions to RMAB exist for both offline and online cases. However, they do not consider the distribution of pulls among the arms. Studies have shown that optimal policies lead to unfairness, where some arms are not exposed enough. Existing works in fairness in RMABs focus heavily on the offline case, which diminishes their application in real-world scenarios where the environment is largely unknown. In the online scenario, we propose the first fair RMAB framework, where each arm receives pulls in proportion to its merit. We define the merit of an arm as a function of its stationary reward distribution. We prove that our algorithm achieves sublinear fairness regret in the single pull case $O(\sqrt{T\ln T})$, with $T$ being the total number of episodes. Empirically, we show that our algorithm performs well in the multi-pull scenario as well.

Fairness of Exposure in Online Restless Multi-armed Bandits

TL;DR

The paper tackles fairness in online Restless MABs by introducing Merit Fairness, which ties arm exposure to a merit measure derived from steady-state rewards:

. It proposes MF-RMAB, an online algorithm that learns transition dynamics with a UCB-style confidence bound and allocates pulls according to a merit-based distribution

, ensuring exposure proportional to merit. The key theoretical result shows a high-probability sublinear fairness regret for the single-pull setting:

, with extension to multi-pull scenarios demonstrated empirically. Experiments on synthetic and CPAP-inspired domains confirm that MF-RMAB achieves fair exposure across arms while maintaining competitive performance, illustrating practical impact for equitable online interventions in settings like healthcare. Overall, the work blends steady-state merit, online learning, and fairness constraints to enable fair, scalable RMAB policies with provable guarantees.

Abstract

, with

being the total number of episodes. Empirically, we show that our algorithm performs well in the multi-pull scenario as well.

Paper Structure (21 sections, 5 theorems, 17 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 5 theorems, 17 equations, 11 figures, 1 table, 1 algorithm.

Introduction
Related Work
Restless Bandits
Fairness in MAB
Fairness in RMAB
Preliminaries
Merit Fair: Merit-based fairness in RMAB
Methodology
Defining the reward
Online RMAB
MF-RMAB
Theoretical Results
Experimental Section
Domains
Experimental Setup
...and 6 more sections

Key Result

Lemma 5.1

For arm $i$, take any $P_i^t \in B_i^t$, define $\mu_i^t = f(P_i^t,1) - f(P_i^t,0)$. Define a policy using Equation (eqn:our_policy). Then, $\exists G_i < T$ such that $N_i^{t+G_i}(s,a) - N_i^t(s,a) > 0 \ \forall s,a$. In other words, after every $G_i$ episodes, arm $i$ has all its state-action pai

Figures (11)

Figure 1: Exposure of arms under Optimal and MF-RMAB on Synthetic dataset after 10k episodes for $N=5$, $K=1$. The arms are arranged in increasing order of their rewards.
Figure 2: The first three plots show Regret vs. Time for different $K$ and $N$ settings on Synthetic dataset. The last plot shows the Regret with different $K/N$ values for $T \times H = 2\times 10^6$ timesteps.
Figure 3: The first three plots show Regret vs. Time for different $K$ and $N$ settings on Synthetic-alternate dataset. The last plot shows the Regret with different $K/N$ values for $T \times H = 2\times 10^6$ timesteps.
Figure 4: The first three plots show Regret vs. Time for different $K$ and $N$ settings on CPAP dataset. The last plot shows the Regret with different $K/N$ values for $T \times H = 2\times 10^6$ timesteps.
Figure 5: Comparison of MF-RMAB and FaWT-Q with respect to exposure and rewards. The y-axes are normalized to 1.
...and 6 more figures

Theorems & Definitions (6)

Lemma 5.1
Definition 5.2
Proposition 5.3
Theorem 5.4
Theorem 5.5
Corollary 5.6

Fairness of Exposure in Online Restless Multi-armed Bandits

TL;DR

Abstract

Fairness of Exposure in Online Restless Multi-armed Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (6)