One Good Source is All You Need: Near-Optimal Regret for Bandits under Heterogeneous Noise
Aadirupa Saha, Amith Bhat, Haipeng Luo
TL;DR
This work addresses online learning in a multi-armed bandit setting with multiple heterogeneous data sources, where each source has unknown noise variance. The authors introduce SOAR, a two-stage algorithm that first prunes high-variance sources via variance concentration bounds and then performs an adaptive min-max LCB-UCB exploration to jointly identify the best arm and the lowest-variance data source. They prove near-oracle regret bounds: an instance-dependent rate of $\tilde{O}\left({\sigma^*}^2\sum_{i=2}^K \frac{\log T}{\Delta_i} + \sqrt{K \sum_{j=1}^M \sigma_j^2}\right)$, with ${\sigma^*}^2$ the minimum source variance, matching the single-source oracle up to logarithmic factors, plus an additive $\tilde{O}(\sqrt{K \sum_j \sigma_j^2})$ cost for source identification. The results improve upon natural baselines that scale with $\sigma_{\max}^2$ or incur costly variance-based distinctions when variances are similar. Empirical results on synthetic data and the MovieLens 25M dataset demonstrate SOAR’s superior performance and its ability to quickly focus on low-variance sources while maintaining strong reward identification.
Abstract
We study $K$-armed Multiarmed Bandit (MAB) problem with $M$ heterogeneous data sources, each exhibiting unknown and distinct noise variances $\{σ_j^2\}_{j=1}^M$. The learner's objective is standard MAB regret minimization, with the additional complexity of adaptively selecting which data source to query from at each round. We propose Source-Optimistic Adaptive Regret minimization (SOAR), a novel algorithm that quickly prunes high-variance sources using sharp variance-concentration bounds, followed by a `balanced min-max LCB-UCB approach' that seamlessly integrates the parallel tasks of identifying the best arm and the optimal (minimum-variance) data source. Our analysis shows SOAR achieves an instance-dependent regret bound of $\tilde{O}\left({σ^*}^2\sum_{i=2}^K \frac{\log T}{Δ_i} + \sqrt{K \sum_{j=1}^M σ_j^2}\right)$, up to preprocessing costs depending only on problem parameters, where ${σ^*}^2 := \min_j σ_j^2$ is the minimum source variance and $Δ_i$ denotes the suboptimality gap of the $i$-th arm. This result is both surprising as despite lacking prior knowledge of the minimum-variance source among $M$ alternatives, SOAR attains the optimal instance-dependent regret of standard single-source MAB with variance ${σ^*}^2$, while incurring only an small (and unavoidable) additive cost of $\tilde O(\sqrt{K \sum_{j=1}^M σ_j^2})$ towards the optimal (minimum variance) source identification. Our theoretical bounds represent a significant improvement over some proposed baselines, e.g. Uniform UCB or Explore-then-Commit UCB, which could potentially suffer regret scaling with $σ_{\max}^2$ in place of ${σ^*}^2$-a gap that can be arbitrarily large when $σ_{\max} \gg σ^*$. Experiments on multiple synthetic problem instances and the real-world MovieLens\;25M dataset, demonstrating the superior performance of SOAR over the baselines.
