Table of Contents
Fetching ...

Minimum Empirical Divergence for Sub-Gaussian Linear Bandits

Kapilan Balagopalan, Kwang-Sung Jun

TL;DR

LinMED introduces a linear-MED framework for sub-Gaussian linear bandits that yields closed-form sampling probabilities via optimal experimental design, making it suitable for offline evaluation. It achieves near-optimal minimax regret $\tilde{O}(d\sqrt{n})$ and a refined instance-dependent bound that scales with the smallest gap $\Delta$, while remaining robust to under- or over-specification of the noise variance $\sigma_*^2$ and norm $\|\theta^*\|$. Theoretical guarantees are complemented by lower-bound results showing competitors like EXP2 and SpannerIGW can incur $\Omega(\Delta\sqrt{n})$ regret in some instances, highlighting LinMED's advantage. Empirically, LinMED performs competitively across delayed rewards, offline evaluation, and high-dimensional settings, with variants tuned to balance exploration and exploitation. The work offers practical OPE-friendly online learning with a principled design-based sampling mechanism and opens avenues for extensions to broader models and offline-benchmarking contexts.

Abstract

We propose a novel linear bandit algorithm called LinMED (Linear Minimum Empirical Divergence), which is a linear extension of the MED algorithm that was originally designed for multi-armed bandits. LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities, unlike the popular randomized algorithm called linear Thompson sampling. Such a feature proves useful for off-policy evaluation where the unbiased evaluation requires accurately computing the sampling probability. We prove that LinMED enjoys a near-optimal regret bound of $d\sqrt{n}$ up to logarithmic factors where $d$ is the dimension and $n$ is the time horizon. We further show that LinMED enjoys a $\frac{d^2}Δ\left(\log^2(n)\right)\log\left(\log(n)\right)$ problem-dependent regret where $Δ$ is the smallest sub-optimality gap. Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.

Minimum Empirical Divergence for Sub-Gaussian Linear Bandits

TL;DR

LinMED introduces a linear-MED framework for sub-Gaussian linear bandits that yields closed-form sampling probabilities via optimal experimental design, making it suitable for offline evaluation. It achieves near-optimal minimax regret and a refined instance-dependent bound that scales with the smallest gap , while remaining robust to under- or over-specification of the noise variance and norm . Theoretical guarantees are complemented by lower-bound results showing competitors like EXP2 and SpannerIGW can incur regret in some instances, highlighting LinMED's advantage. Empirically, LinMED performs competitively across delayed rewards, offline evaluation, and high-dimensional settings, with variants tuned to balance exploration and exploitation. The work offers practical OPE-friendly online learning with a principled design-based sampling mechanism and opens avenues for extensions to broader models and offline-benchmarking contexts.

Abstract

We propose a novel linear bandit algorithm called LinMED (Linear Minimum Empirical Divergence), which is a linear extension of the MED algorithm that was originally designed for multi-armed bandits. LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities, unlike the popular randomized algorithm called linear Thompson sampling. Such a feature proves useful for off-policy evaluation where the unbiased evaluation requires accurately computing the sampling probability. We prove that LinMED enjoys a near-optimal regret bound of up to logarithmic factors where is the dimension and is the time horizon. We further show that LinMED enjoys a problem-dependent regret where is the smallest sub-optimality gap. Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.

Paper Structure

This paper contains 34 sections, 21 theorems, 181 equations, 15 figures, 2 tables, 5 algorithms.

Key Result

Theorem 1

Under Assumptions main-assump:env-assumption, main-assump:opt-lev-scr-assumption, and main-assump:opt-cardinality-assumption, with $\delta_t = \frac{1}{t+1}$, LinMED satisfies, $\forall n \geq 1$,

Figures (15)

  • Figure 1: IPW scores of the uniform policy when the logging policy is LinMED and LinTS respectively. We used 1,000 Monte Carlo samples to estimate the sampling probabilities of LinTS. Oracle denotes the expected reward of the uniform policy. LinTS shows a nontrivial amount of bias, unlike LinMED (mean of LinMED is exactly aligned with the oracle, thus invisible in the plot). See Appendix \ref{['app-subsection:offline-eval-exp-subsection']} for details.
  • Figure 2: LinMED vs LinMEDNOPT, with $\sigma^2 = \sigma_*^2 = 3$ for $K \in \{4,8,16,32,64\}$, and $(\alpha_{\text{emp}},\alpha_{\text{opt}}) \in \{(0.99,0.005),(0.90,0.05),(0.5,0.25)\}$, $n = 20000$.
  • Figure 3: Large gap instance experiments
  • Figure 4: End of optimism experiments
  • Figure 5: End of optimism experiments $\varepsilon = 0.005$
  • ...and 10 more figures

Theorems & Definitions (46)

  • Theorem 1: Instance-dependent bound
  • Theorem 2: Minimax bound
  • Corollary 3: Instance-dependent bound
  • Corollary 4: Minimax bound
  • Corollary 5: Minimax bound
  • Theorem 6
  • Theorem 7
  • Lemma 1: Regret Bound
  • proof
  • proof
  • ...and 36 more