Table of Contents
Fetching ...

Identifying the Best Arm in the Presence of Global Environment Shifts

Phurinut Srisawad, Juergen Branke, Long Tran-Thanh

TL;DR

The paper addresses best-arm identification under global environment shifts, where rewards satisfy μ_{ij} = μ_i + s_j, and environments are piecewise stationary. It reframes the problem as a regression task and introduces an OLS-based selection approach together with LinLUCB, an allocation policy that integrates regression uncertainty into a confidence bound. Key contributions include (i) an unbiased, tractable OLS estimator for arm means and environment shifts with higher-order covariance structure, (ii) a regression-informed LUCB-like allocation that enforces two distinct samples per environment, and (iii) extensive empirical evidence showing LinLUCB outperforms standard policies and Reduce-to-MAB baselines across multiple non-stationary settings. The work demonstrates that exploiting the global-shift structure yields practical performance gains in non-stationary BAI and provides a foundation for future extensions to relax assumptions on shift patterns and noise heterogeneity.

Abstract

This paper formulates a new Best-Arm Identification problem in the non-stationary stochastic bandits setting, where the means of all arms are shifted in the same way due to a global influence of the environment. The aim is to identify the unique best arm across environmental change given a fixed total budget. While this setting can be regarded as a special case of Adversarial Bandits or Corrupted Bandits, we demonstrate that existing solutions tailored to those settings do not fully utilise the nature of this global influence, and thus, do not work well in practice (despite their theoretical guarantees). To overcome this issue, in this paper we develop a novel selection policy that is consistent and robust in dealing with global environmental shifts. We then propose an allocation policy, LinLUCB, which exploits information about global shifts across all arms in each environment. Empirical tests depict a significant improvement in our policies against other existing methods.

Identifying the Best Arm in the Presence of Global Environment Shifts

TL;DR

The paper addresses best-arm identification under global environment shifts, where rewards satisfy μ_{ij} = μ_i + s_j, and environments are piecewise stationary. It reframes the problem as a regression task and introduces an OLS-based selection approach together with LinLUCB, an allocation policy that integrates regression uncertainty into a confidence bound. Key contributions include (i) an unbiased, tractable OLS estimator for arm means and environment shifts with higher-order covariance structure, (ii) a regression-informed LUCB-like allocation that enforces two distinct samples per environment, and (iii) extensive empirical evidence showing LinLUCB outperforms standard policies and Reduce-to-MAB baselines across multiple non-stationary settings. The work demonstrates that exploiting the global-shift structure yields practical performance gains in non-stationary BAI and provides a foundation for future extensions to relax assumptions on shift patterns and noise heterogeneity.

Abstract

This paper formulates a new Best-Arm Identification problem in the non-stationary stochastic bandits setting, where the means of all arms are shifted in the same way due to a global influence of the environment. The aim is to identify the unique best arm across environmental change given a fixed total budget. While this setting can be regarded as a special case of Adversarial Bandits or Corrupted Bandits, we demonstrate that existing solutions tailored to those settings do not fully utilise the nature of this global influence, and thus, do not work well in practice (despite their theoretical guarantees). To overcome this issue, in this paper we develop a novel selection policy that is consistent and robust in dealing with global environmental shifts. We then propose an allocation policy, LinLUCB, which exploits information about global shifts across all arms in each environment. Empirical tests depict a significant improvement in our policies against other existing methods.
Paper Structure (17 sections, 1 theorem, 17 equations, 14 figures, 2 algorithms)

This paper contains 17 sections, 1 theorem, 17 equations, 14 figures, 2 algorithms.

Key Result

Theorem 1

For any policy under which the OLS estimator is valid and all arms are sampled infinitely often, or $N_i:=\sum_{j=1}^J n_{ij} \rightarrow \infty$ for all $i$, assume that $J \rightarrow \infty$ and there exist constants $v^*$, $w^*$ such that $0<v^*\leq\mathbb{V}[\hat{s}_j], Cov(\hat{s}_j,\hat{s}_m)

Figures (14)

  • Figure 1: Example of a policy sampling from arms on a BAI problem with global environment shifts.
  • Figure 2: PICS of existing algorithms from $10^5$ replications on the Gaussian configuration of 5 arms where the gaps of ordered arms ($\delta=0.5$) are equally distributed and arms have equal variance ($\sigma=1$). The lengths of environments $j$ are uniformly distributed, $\Delta cp_{j} \sim \tilde{\mathcal{U}}(2,50)$ and the shift is a random variable, $s_j \sim \mathcal{U}(0,20)$.
  • Figure 3: Representative graph structure illustrates how an allocation policy produces the evolution of the graph at the end of each environment.
  • Figure 4: The performance of LinLUCB and benchmark policies
  • Figure 5: Comparison of Reduce-to-MAB strategies and the corresponding performances in a stationary environment
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof