Table of Contents
Fetching ...

The Fragility of Optimized Bandit Algorithms

Lin Fan, Peter W. Glynn

TL;DR

The fragility of optimized bandit algorithms is studied for the first time, and the authors characterize sharp trade-offs between the expected regret rate and the heaviness of the regret tail.

Abstract

Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also show a sharp trade-off between the amount of UCB exploration and the heaviness of the resulting regret distribution tail.

The Fragility of Optimized Bandit Algorithms

TL;DR

The fragility of optimized bandit algorithms is studied for the first time, and the authors characterize sharp trade-offs between the expected regret rate and the heaviness of the regret tail.

Abstract

Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for , the 'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also show a sharp trade-off between the amount of UCB exploration and the heaviness of the resulting regret distribution tail.

Paper Structure

This paper contains 34 sections, 28 theorems, 214 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\pi$ be $\mathcal{M}_P$-optimized. Then for any environment $\nu = (P^{\mu_1},\dots,P^{\mu_K}) \in \mathcal{M}_P^K$ and the $i$-th-best arm $r(i)$, with $B_\gamma(T) = [\log^{1+\gamma}(T), (1-\gamma)T]$ and any $\gamma \in (0,1)$. If in addition, $P$ is discrimination equivalent, then for the second-best arm $r(2)$, uniformly for $x \in [T^\gamma, (1-\gamma)T]$ for any $\gamma \in (0,1)$ as

Figures (5)

  • Figure 1: Plot of $\log \mathbb{P}_{\nu \pi}(N_2(T) \ge 0.8 T)/\log(T)$ vs $T$. Environment $\nu = (N(0.1,\sigma_0^2),N(0,\sigma_0^2))$. Algorithm $\pi$ is KL-UCB for iid unit-variance Gaussian rewards. The curves correspond to the cases $\sigma_0^2 = 1,1.5,\dots,4$, as indicated by the legend. The curves asymptote to $-1/\sigma_0^2$ in each case, which agrees with (\ref{['varianceratio']}) in Corollary \ref{['cor1']}. To generate each curve, $2 \times 10^6$ simulation runs were used.
  • Figure 2: Plot of $\log \mathbb{P}_{\nu \pi}(N_2(T) \ge 0.8 T)/\log(T)$ vs $T$. Environment $\nu$ consists of two Gaussian AR(1) processes with common AR coefficient $\beta_0$, and equilibrium distributions $(N(0.1,1),N(0,1))$. Algorithm $\pi$ is KL-UCB for iid unit-variance Gaussian rewards. The curves correspond to the cases $\beta_0 = 0,0.15,\dots,0.9$, as indicated by the legend. The curves approximately asymptote to $-(1-\beta_0)/(1+\beta_0)$, which agrees with the lower bound in Corollary \ref{['cor2']} and (\ref{['arvarianceratio2']}). To generate each curve, $2 \times 10^6$ simulation runs were used.
  • Figure 3: Plot of $\log \mathbb{P}_{\nu \pi}(N_2(T) > x)/\log(x)$ vs $x$ for $x \in [0.05 T, 0.95 T]$ (with time horizon $T$ fixed). Environment $\nu = (\text{Ber}(q),\text{Ber}(0.4))$. Algorithm $\pi$ is KL-UCB for iid Bernoulli rewards. Top: $q = 0.475$, $T = 10^4$; Middle: $q = 0.5$, $T = 5 \times 10^3$; Bottom: $q = 0.525$, $T = 3.4 \times 10^3$. Each curve asymptotes to $\lim_{z \downarrow 0} d_P(z,q)/d_P(z,0.4)$ (with values $-1.26$ (top), $-1.36$ (middle), $-1.46$ (bottom)), as specified by Theorem \ref{['generalupperbound']} and (\ref{['klequivalence']}). To generate each curve, $8 \times 10^6$ simulation runs were used.
  • Figure 4: Plot of $\log \mathbb{P}_{\nu \pi}(N_2(T) > x)/\log(x)$ vs $x$ for $x \in [0.05 T, 0.95 T]$, with fixed time horizon $T = 7 \times 10^3$. Environment $\nu = (N(0.1,1),N(0,1))$. $\pi$ is Algorithm \ref{['alg1']} with KL divergence $d_P$ between unit-variance Gaussian distributions, and $f(t) = (1+b)\log(t)$ (to aim for a regret tail exponent of $-(1+b)$). The curves correspond to the cases $b = 0,0.25,0.5,0.75$, as indicated by the legend. As predicted by (\ref{['psitail1']}) in Proposition \ref{['prop5']}, the curves asymptote to $-1$, $-1.25$, $-1.5$, $-1.75$. To generate each curve, $4 \times 10^7$ simulation runs were used.
  • Figure 5: Plot of $\log \mathbb{P}_{\nu \pi}(N_2(T) > x)/\log(x)$ vs $x$ for $x \in [0.05 T, 0.95 T]$, with fixed time horizon $T = 10^4$. Environment $\nu$ consists of two Gaussian AR(1) processes with common AR coefficient $\beta_0$, and equilibrium distributions $(N(0.1,1),N(0,1))$. $\pi$ is Algorithm \ref{['alg1']} with KL divergence $d_P$ between unit-variance Gaussian distributions, and $f(t) = (1+b)\log(t)$ with $1+b = 1.1 \cdot \frac{1+\beta_0}{1-\beta_0}$ (to aim for a regret tail exponent of $\approx -1.1$ in each case of $\beta_0$). The curves correspond to the cases $\beta_0 = 0,0.15,0.3,0.45$, as indicated by the legend. All curves asymptote to (slightly less than) $-1.1$, as desired. To generate each curve, $4 \times 10^7$ simulation runs were used.

Theorems & Definitions (50)

  • Definition 1: $\mathcal{M}_P$-Consistent Algorithm
  • Definition 2: $\mathcal{M}_P$-Optimized Algorithm
  • Definition 3: Discrimination Equivalence
  • Theorem 1
  • Lemma 1
  • Proposition 1
  • Proposition 2
  • Example 1
  • Example 2
  • Example 3
  • ...and 40 more