Table of Contents
Fetching ...

Fooling Algorithms in Non-Stationary Bandits using Belief Inertia

Gal Mendelson, Eyal Tadmor

TL;DR

The paper tackles worst-case regret in piecewise-stationary multi-armed bandits by introducing a belief inertia framework that tracks how empirical reward averages resist changes. It shows that standard algorithms such as Explore-Then-Commit, $\varepsilon$-greedy, and UCB can be driven to regret that scales linearly with the horizon $T$, even with a single breakpoint, by constructing adversarial instances that exploit inertia. The results extend to periodically restarted algorithms, establishing lower bounds that capture the cost of guarding against non-stationarity and revealing that restarts can incur substantial penalties in certain regimes. Overall, the work provides sharp, explicit lower bounds that illuminate fundamental limits on learning in non-stationary environments and guide the design of robust strategies.

Abstract

We study the problem of worst case regret in piecewise stationary multi armed bandits. While the minimax theory for stationary bandits is well established, understanding analogous limits in time-varying settings is challenging. Existing lower bounds rely on what we refer to as infrequent sampling arguments, where long intervals without exploration allow adversarial reward changes that induce large regret. In this paper, we introduce a fundamentally different approach based on a belief inertia argument. Our analysis captures how an algorithm's empirical beliefs, encoded through historical reward averages, create momentum that resists new evidence after a change. We show how this inertia can be exploited to construct adversarial instances that mislead classical algorithms such as Explore Then Commit, epsilon greedy, and UCB, causing them to suffer regret that grows linearly with T and with a substantial constant factor, regardless of how their parameters are tuned, even with a single change point. We extend the analysis to algorithms that periodically restart to handle non stationarity and prove that, even then, the worst case regret remains linear in T. Our results indicate that utilizing belief inertia can be a powerful method for deriving sharp lower bounds in non stationary bandits.

Fooling Algorithms in Non-Stationary Bandits using Belief Inertia

TL;DR

The paper tackles worst-case regret in piecewise-stationary multi-armed bandits by introducing a belief inertia framework that tracks how empirical reward averages resist changes. It shows that standard algorithms such as Explore-Then-Commit, -greedy, and UCB can be driven to regret that scales linearly with the horizon , even with a single breakpoint, by constructing adversarial instances that exploit inertia. The results extend to periodically restarted algorithms, establishing lower bounds that capture the cost of guarding against non-stationarity and revealing that restarts can incur substantial penalties in certain regimes. Overall, the work provides sharp, explicit lower bounds that illuminate fundamental limits on learning in non-stationary environments and guide the design of robust strategies.

Abstract

We study the problem of worst case regret in piecewise stationary multi armed bandits. While the minimax theory for stationary bandits is well established, understanding analogous limits in time-varying settings is challenging. Existing lower bounds rely on what we refer to as infrequent sampling arguments, where long intervals without exploration allow adversarial reward changes that induce large regret. In this paper, we introduce a fundamentally different approach based on a belief inertia argument. Our analysis captures how an algorithm's empirical beliefs, encoded through historical reward averages, create momentum that resists new evidence after a change. We show how this inertia can be exploited to construct adversarial instances that mislead classical algorithms such as Explore Then Commit, epsilon greedy, and UCB, causing them to suffer regret that grows linearly with T and with a substantial constant factor, regardless of how their parameters are tuned, even with a single change point. We extend the analysis to algorithms that periodically restart to handle non stationarity and prove that, even then, the worst case regret remains linear in T. Our results indicate that utilizing belief inertia can be a powerful method for deriving sharp lower bounds in non stationary bandits.

Paper Structure

This paper contains 18 sections, 10 theorems, 62 equations, 3 figures.

Key Result

Lemma 1

For any algorithm $\pi$ and $\Gamma \in \mathbb{N}$,

Figures (3)

  • Figure 1: UCB indices in a two-armed bandit with deterministic rewards that switch from $(0, 0)$ to $(0.38,1)$ at round 191. Following the change, the suboptimal arm B experiences a sharp increase in its index, which remains higher than that of the optimal arm A for the remainder of the rounds.
  • Figure 2: Explore-Then-Commit in a two-armed bandit with deterministic rewards that switch from $(0, 1)$ to $(1,0)$ at round 21. The algorithm commits to the suboptimal arm B.
  • Figure 3: $\epsilon$-greedy in a two-armed bandit with deterministic rewards that switch from $(0.5, 0)$ to $(0.5, 1)$ at round 200. After the change, the algorithm’s belief updates slowly, causing the newly optimal arm to remain underexplored for most of the remaining rounds.

Theorems & Definitions (16)

  • Lemma 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Lemma 2
  • proof
  • Theorem 3
  • proof
  • Lemma 3
  • ...and 6 more