Table of Contents
Fetching ...

Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits

Piotr M. Suder, Eric Laber

TL;DR

This work tackles the sensitivity of Information-Directed Sampling (IDS) to a priori parameter-norm bounds in linear bandits with heteroskedastic noise by introducing EBIDS, an algorithm that iteratively refines a high-probability bound on $B^* = \|\boldsymbol{\theta}^*\|_2$ via accumulating data. EBIDS employs a bound-action mixture (BAM) that combines bound-improvement information $I_t^B$ with model-information $I_t^{\text{EB-UCB}}$ in an initial bound-exploration phase, followed by a bound-exploitation phase that relies on the refined bound to achieve sublinear regret. Theoretical guarantees show regret and pseudo-regret bounds that eventually become independent of the initial bound $B$, and simulations demonstrate EBIDS outperforms competitive norm-agnostic approaches while approaching oracle performance in many settings. The approach provides a general design principle for balancing bound refinement and regret minimization, with potential applicability to broader IDS/UCB frameworks beyond the linear, heteroskedastic bandit setting.

Abstract

Information-directed sampling (IDS) is a powerful framework for solving bandit problems which has shown strong results in both Bayesian and frequentist settings. However, frequentist IDS, like many other bandit algorithms, requires that one have prior knowledge of a (relatively) tight upper bound on the norm of the true parameter vector governing the reward model in order to achieve good performance. Unfortunately, this requirement is rarely satisfied in practice. As we demonstrate, using a poorly calibrated bound can lead to significant regret accumulation. To address this issue, we introduce a novel frequentist IDS algorithm that iteratively refines a high-probability upper bound on the true parameter norm using accumulating data. We focus on the linear bandit setting with heteroskedastic subgaussian noise. Our method leverages a mixture of relevant information gain criteria to balance exploration aimed at tightening the estimated parameter norm bound and directly searching for the optimal action. We establish regret bounds for our algorithm that do not depend on an initially assumed parameter norm bound and demonstrate that our method outperforms state-of-the-art IDS and UCB algorithms.

Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits

TL;DR

This work tackles the sensitivity of Information-Directed Sampling (IDS) to a priori parameter-norm bounds in linear bandits with heteroskedastic noise by introducing EBIDS, an algorithm that iteratively refines a high-probability bound on via accumulating data. EBIDS employs a bound-action mixture (BAM) that combines bound-improvement information with model-information in an initial bound-exploration phase, followed by a bound-exploitation phase that relies on the refined bound to achieve sublinear regret. Theoretical guarantees show regret and pseudo-regret bounds that eventually become independent of the initial bound , and simulations demonstrate EBIDS outperforms competitive norm-agnostic approaches while approaching oracle performance in many settings. The approach provides a general design principle for balancing bound refinement and regret minimization, with potential applicability to broader IDS/UCB frameworks beyond the linear, heteroskedastic bandit setting.

Abstract

Information-directed sampling (IDS) is a powerful framework for solving bandit problems which has shown strong results in both Bayesian and frequentist settings. However, frequentist IDS, like many other bandit algorithms, requires that one have prior knowledge of a (relatively) tight upper bound on the norm of the true parameter vector governing the reward model in order to achieve good performance. Unfortunately, this requirement is rarely satisfied in practice. As we demonstrate, using a poorly calibrated bound can lead to significant regret accumulation. To address this issue, we introduce a novel frequentist IDS algorithm that iteratively refines a high-probability upper bound on the true parameter norm using accumulating data. We focus on the linear bandit setting with heteroskedastic subgaussian noise. Our method leverages a mixture of relevant information gain criteria to balance exploration aimed at tightening the estimated parameter norm bound and directly searching for the optimal action. We establish regret bounds for our algorithm that do not depend on an initially assumed parameter norm bound and demonstrate that our method outperforms state-of-the-art IDS and UCB algorithms.

Paper Structure

This paper contains 18 sections, 8 theorems, 160 equations, 4 figures, 2 algorithms.

Key Result

Theorem 1

For any $T$ let $G$ be a fixed subset of $\{1, \ldots , T\}$ and let $\{A_t\}_{t = 1}^T$ be an $\boldsymbol{H}_t$-adapted sequence in $\mathcal{A}$. Then and if $\widehat{\Delta}_t(A_t) \geq \Delta(A_t)$ for all $t \in G$ then with probability $1$ we have

Figures (4)

  • Figure 1: Regret incurred by IDS-UCB and UCB with: (a) conservative $B=100$; (b) anti-conservative $B=1$. In both plots we include the oracle versions of IDS-UCB, and UCB using $B=B^*$ for reference. However, note that it is not feasible to implement them in most practical settings. The solid and dashes lines represent the regret averaged over $200$ repeated experiments, while the shaded bounds are $95\%$ pointwise confidence bands.
  • Figure 2: Regret incurred by EBIDS, EB-UCB, NAOFUL, OLSOFUL, IDS-UCB and UCB with conservative $B=100$. We include the oracle versions of EBIDS, IDS-UCB, and UCB using $B=B^*$ for reference. The solid and dashes lines represent the regret averaged over $200$ repeated experiments, while the shaded bounds represent $95\%$ pointwise confidence bounds.
  • Figure 3: Average regret for EBIDS averaged over $200$ repeated experiments with $T=500$ steps under different values of the tuning parameter $\alpha$ and the length $T_B$ of the bound exploration phase.
  • Figure 4: Regret incurred by EBIDS, EB-UCB, NAOFUL, OLSOFUL, IDS-UCB and UCB using conservative $B=100$ for simulation settings (a)-(d) outlined above. We include the oracle versions of EBIDS, IDS-UCB, and UCB using $B=B^*$ for reference. The solid and dashes lines represent the regret averaged over $200$ repeated experiments, while the shaded bounds are $95\%$ pointwise confidence bands.

Theorems & Definitions (8)

  • Theorem 1: Kirschner
  • Theorem 2
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 1
  • Lemma 2
  • Lemma 3