Multi-Armed Bandits with Interference
Su Jia, Peter Frazier, Nathan Kallus
TL;DR
This work extends online experimentation to settings with spatially decaying interference (MABI) and adversarial rewards, introducing a formal model with unit-level rewards depending on all treatments. It shows switchback policies achieve $\tilde{O}(\sqrt{T})$ expected regret but with high variance, motivating a cluster-based approach that combines implicit exploration and robust partitioning to obtain a high-probability regret bound vanishing in $N$. The proposed HT-IX estimator within EXP3-IX, together with a robust $(\ell,r)$-random partition, yields near-optimal $\tilde{O}(\sqrt{kT})$ expected regret and tail guarantees that improve as the number of units grows, with Corollaries covering no-interference, $\kappa$-neighborhood interference, and power-law interference. Experiments corroborate the theory, showing substantial tail-risk reductions for cluster-based designs in large-scale settings typical of online platforms.
Abstract
Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {\em Multi-armed Bandits with Interference} (MABI), where the learner assigns an arm to each of $N$ experimental units over a time horizon of $T$ rounds. The reward of each unit in each round depends on the treatments of {\em all} units, where the influence of a unit decays in the spatial distance between units. Furthermore, we employ a general setup wherein the reward functions are chosen by an adversary and may vary arbitrarily across rounds and units. We first show that switchback policies achieve an optimal {\em expected} regret $\tilde O(\sqrt T)$ against the best fixed-arm policy. Nonetheless, the regret (as a random variable) for any switchback policy suffers a high variance, as it does not account for $N$. We propose a cluster randomization policy whose regret (i) is optimal in {\em expectation} and (ii) admits a high probability bound that vanishes in $N$.
