Table of Contents
Fetching ...

Infrequent Exploration in Linear Bandits

Harin Lee, Min-hwan Oh

TL;DR

INFEX introduces a practical framework for infrequent exploration in linear bandits by interleaving a base exploratory policy with predominantly greedy actions. Theoretical results show that, when the exploration schedule satisfies $f(t)=\omega(\log t)$, the regret remains polylogarithmic and closely matches the base algorithm’s regret up to a schedule-dependent constant, while significantly reducing costly exploration computations. The bounds separate the base exploration contribution from the greedy phases and rely on standard linear-bandit tools like ridge estimation and elliptical potential; a lower bound demonstrates the necessity of the $\omega(\log t)$ condition for polylog regret. Empirical evaluations with LinUCB and LinTS confirm state-of-the-art regret performance and notable runtime improvements under various periodic exploration schedules. The framework’s modularity enables plugging in any fully adaptive exploratory method, offering practical impact for safety-critical or costly domains where frequent exploration is undesirable.

Abstract

We study the problem of infrequent exploration in linear bandits, addressing a significant yet overlooked gap between fully adaptive exploratory methods (e.g., UCB and Thompson Sampling), which explore potentially at every time step, and purely greedy approaches, which require stringent diversity assumptions to succeed. Continuous exploration can be impractical or unethical in safety-critical or costly domains, while purely greedy strategies typically fail without adequate contextual diversity. To bridge these extremes, we introduce a simple and practical framework, INFEX, explicitly designed for infrequent exploration. INFEX executes a base exploratory policy according to a given schedule while predominantly choosing greedy actions in between. Despite its simplicity, our theoretical analysis demonstrates that INFEX achieves instance-dependent regret matching standard provably efficient algorithms, provided the exploration frequency exceeds a logarithmic threshold. Additionally, INFEX is a general, modular framework that allows seamless integration of any fully adaptive exploration method, enabling wide applicability and ease of adoption. By restricting intensive exploratory computations to infrequent intervals, our approach can also enhance computational efficiency. Empirical evaluations confirm our theoretical findings, showing state-of-the-art regret performance and runtime improvements over existing methods.

Infrequent Exploration in Linear Bandits

TL;DR

INFEX introduces a practical framework for infrequent exploration in linear bandits by interleaving a base exploratory policy with predominantly greedy actions. Theoretical results show that, when the exploration schedule satisfies , the regret remains polylogarithmic and closely matches the base algorithm’s regret up to a schedule-dependent constant, while significantly reducing costly exploration computations. The bounds separate the base exploration contribution from the greedy phases and rely on standard linear-bandit tools like ridge estimation and elliptical potential; a lower bound demonstrates the necessity of the condition for polylog regret. Empirical evaluations with LinUCB and LinTS confirm state-of-the-art regret performance and notable runtime improvements under various periodic exploration schedules. The framework’s modularity enables plugging in any fully adaptive exploratory method, offering practical impact for safety-critical or costly domains where frequent exploration is undesirable.

Abstract

We study the problem of infrequent exploration in linear bandits, addressing a significant yet overlooked gap between fully adaptive exploratory methods (e.g., UCB and Thompson Sampling), which explore potentially at every time step, and purely greedy approaches, which require stringent diversity assumptions to succeed. Continuous exploration can be impractical or unethical in safety-critical or costly domains, while purely greedy strategies typically fail without adequate contextual diversity. To bridge these extremes, we introduce a simple and practical framework, INFEX, explicitly designed for infrequent exploration. INFEX executes a base exploratory policy according to a given schedule while predominantly choosing greedy actions in between. Despite its simplicity, our theoretical analysis demonstrates that INFEX achieves instance-dependent regret matching standard provably efficient algorithms, provided the exploration frequency exceeds a logarithmic threshold. Additionally, INFEX is a general, modular framework that allows seamless integration of any fully adaptive exploration method, enabling wide applicability and ease of adoption. By restricting intensive exploratory computations to infrequent intervals, our approach can also enhance computational efficiency. Empirical evaluations confirm our theoretical findings, showing state-of-the-art regret performance and runtime improvements over existing methods.

Paper Structure

This paper contains 35 sections, 17 theorems, 72 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathsf{Alg}$ be a linear bandit algorithm that attains polylogarithmic regret, specifically $\mathcal{R}_{\mathsf{Alg}}(T) = \mathcal{O} \left( \frac{d^a}{\Delta^b} \log^c T \right)$ with probability at least $1 - 1 / T$ for some constants $a, b, c \ge 0$. Let ${\mathcal{T}}_e \subset \mathbb{ where $G_{\text{const}}(\tau_{\mathsf{Alg}}, f)$ is independent of $T$, $\tau_{\mathsf{Alg}} \in \m

Figures (3)

  • Figure 1: Comparison of total regret (left) and computation time (right) when $d = 10$, $T = 10000$, and $K = 10$ (top), $K = 100$ (middle), and $K = 1000$ (bottom).
  • Figure 2: Comparison of total regret (left) and computation time (right) when $d = 20$, $T = 10000$, and $K = 10$ (top), $K = 100$ (middle), and $K = 1000$ (bottom).
  • Figure 3: Comparison of total regret (left) and computation time (right) when $d = 40$, $T = 10000$, and $K = 10$ (top), $K = 100$ (middle), and $K = 1000$ (bottom).

Theorems & Definitions (24)

  • Remark 1: Substituting the ridge estimator.
  • Theorem 1: Regret of $\texttt{INFEX}$
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Proposition 1
  • Lemma 4
  • Remark 2
  • ...and 14 more