Table of Contents
Fetching ...

Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent

Budhaditya Halder, Ishan Sengupta, Koustav Chowdhury, Koulik Khamaru

TL;DR

A systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case, is developed, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework.

Abstract

Statistical inference with bandit data presents fundamental challenges due to adaptive sampling, which violates the independence assumptions underlying classical asymptotic theory. Recent work has identified stability as a sufficient condition for valid inference under adaptivity. This paper develops a systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case. Our contributions are threefold. First, we establish a general stability criterion: if the average iterates of a stochastic mirror descent algorithm converge in ratio to a non-random probability vector, then the induced bandit algorithm is stable. This result provides a unified lens for analyzing stability across diverse algorithmic instantiations. Second, we introduce a family of regularized-EXP3 algorithms employing a log-barrier regularizer with appropriately tuned parameters. We prove that these algorithms satisfy our stability criterion and, as an immediate corollary, that Wald-type confidence intervals for linear functionals of the mean parameter achieve nominal coverage. Notably, we show that the same algorithms attain minimax-optimal regret guarantees up to logarithmic factors, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework. Third, we establish robustness to corruption: a modified variant of regularized-EXP3 maintains asymptotic normality of empirical arm means even in the presence of $o(T^{1/2})$ adversarial corruptions. This stands in sharp contrast to other stable algorithms such as UCB, which suffer linear regret even under logarithmic levels of corruption.

Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent

TL;DR

A systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case, is developed, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework.

Abstract

Statistical inference with bandit data presents fundamental challenges due to adaptive sampling, which violates the independence assumptions underlying classical asymptotic theory. Recent work has identified stability as a sufficient condition for valid inference under adaptivity. This paper develops a systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case. Our contributions are threefold. First, we establish a general stability criterion: if the average iterates of a stochastic mirror descent algorithm converge in ratio to a non-random probability vector, then the induced bandit algorithm is stable. This result provides a unified lens for analyzing stability across diverse algorithmic instantiations. Second, we introduce a family of regularized-EXP3 algorithms employing a log-barrier regularizer with appropriately tuned parameters. We prove that these algorithms satisfy our stability criterion and, as an immediate corollary, that Wald-type confidence intervals for linear functionals of the mean parameter achieve nominal coverage. Notably, we show that the same algorithms attain minimax-optimal regret guarantees up to logarithmic factors, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework. Third, we establish robustness to corruption: a modified variant of regularized-EXP3 maintains asymptotic normality of empirical arm means even in the presence of adversarial corruptions. This stands in sharp contrast to other stable algorithms such as UCB, which suffer linear regret even under logarithmic levels of corruption.
Paper Structure (23 sections, 18 theorems, 101 equations, 4 figures, 1 algorithm)

This paper contains 23 sections, 18 theorems, 101 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

Given any stable algorithm $\mathcal{A}$, let $\widehat{\mu}_{a,T}(\mathcal{A})$ and $\widehat{\sigma}^2_{a,T}$, respectively, denote the sample mean and variance of arm $a$ losses (or rewards) at time $T$. Then for all $a \in [K]$

Figures (4)

  • Figure 1: Empirical behavior for Algorithm \ref{['alg:st-exp3']} for Bernoulli bandit with $\mu=(0.9,0.3,0.1)^\top$ and $\alpha=1$ : standardized estimation errors $\sqrt{n_{a,T}}(\widehat{\mu}_{a,T} - \mu_a)/\widehat{\sigma}_{a,T}$ are approximately standard normal.
  • Figure 2: Empirical behavior for Algorithm \ref{['alg:st-exp3']} for Bernoulli bandit with $\mu=(0.9,0.3,0.1)^\top$: empirical coverage probabilities nearly aligned with diagonal.
  • Figure 3: Empirical behavior for Algorithm \ref{['alg:st-exp3']} for Bernoulli bandit with $\mu=(0.7,0.7,0.7)^\top$ and $\alpha=1/2$: the proportion of pulls concentrate around $1/3$ for each arm.
  • Figure 4: Empirical behavior for Algorithm \ref{['alg:st-exp3']} for Bernoulli bandit with $\mu=(0.7,0.7,0.7)^\top$ and $\alpha=1/2$ : standardized estimation errors $\sqrt{n_{a,T}}(\widehat{\mu}_{a,T} - \mu_a)/\widehat{\sigma}_{a,T}$ are approximately standard normal.

Theorems & Definitions (36)

  • Definition 1
  • Lemma 1: laiwei82 Theorem 3
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Definition 2
  • Lemma 3
  • proof
  • ...and 26 more