Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

Jie Bian; Vincent Y. F. Tan

Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

Jie Bian, Vincent Y. F. Tan

TL;DR

The paper extends the Indexed Minimum Empirical Divergence (IMED) framework to stochastic linear bandits with varying arm sets by introducing the LinIMED family (LinIMED-1/2/3) and the SupLinIMED variant for finite arm sets. It establishes regret guarantees, achieving $\widetilde{O}(d\sqrt{T})$ for LinIMED and $\widetilde{O}(\sqrt{dT})$ for SupLinIMED, with LinIMED-3 matching the $O(d\sqrt{T}\log T)$ rate of LinUCB/OFUL under standard assumptions. The authors connect LinIMED to Information Directed Sampling (IDS) through the index structure, while maintaining computational efficiency via a deterministic design and a decoupled exploitation term. Empirical results on synthetic data and the MovieLens dataset show LinIMED variants frequently outperform LinUCB and Linear Thompson Sampling, particularly in regimes where exploration-exploitation balance benefits from the squared-gap-based index. Together, these contributions advance robust, efficient learning in contextual linear bandits and offer practical alternatives to established linear bandit methods.

Abstract

The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback--Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a $\widetilde{O}(d\sqrt{T})$ upper regret bound where $d$ is the dimension of the context and $T$ is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.

Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

TL;DR

for LinIMED and

for SupLinIMED, with LinIMED-3 matching the

rate of LinUCB/OFUL under standard assumptions. The authors connect LinIMED to Information Directed Sampling (IDS) through the index structure, while maintaining computational efficiency via a deterministic design and a decoupled exploitation term. Empirical results on synthetic data and the MovieLens dataset show LinIMED variants frequently outperform LinUCB and Linear Thompson Sampling, particularly in regimes where exploration-exploitation balance benefits from the squared-gap-based index. Together, these contributions advance robust, efficient learning in contextual linear bandits and offer practical alternatives to established linear bandit methods.

Abstract

upper regret bound where

is the dimension of the context and

is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.

Paper Structure (38 sections, 15 theorems, 139 equations, 4 figures, 9 tables, 3 algorithms)

This paper contains 38 sections, 15 theorems, 139 equations, 4 figures, 9 tables, 3 algorithms.

Introduction
Motivation and Related Work
Problem Statement
Notations:
The Stochastic Linear Bandit Model:
Description of LinIMED Algorithms
Description of the SupLinIMED Algorithm
Relation to the IMED algorithm of
Relation to Information Directed Sampling (IDS) for Linear Bandits
Theorem Statements
Proof Sketch of Theorem \ref{['Thm:LinIMED-1']}
Empirical Studies
Experiments on a Synthetic Dataset in the Varying Arm Set Setting
Experiments on the "End of Optimism" instance
Experiments on the MovieLens Dataset
...and 23 more sections

Key Result

Theorem 1

Under Assumption assumption1, the assumption that $\langle \theta^*, x_{t,a}\rangle \ge 0$ for all $t\ge1$ and $a\in \mathcal{A}_t$, and the assumption that $\sqrt{\lambda}S\ge 1$, the regret of the LinIMED-1 algorithm is upper bounded as follows:

Figures (4)

Figure 1: Simulation results (expected regrets) on the synthetic dataset with different $K$'s
Figure 2: Simulation results (expected regrets) on the synthetic dataset with different $d$'s
Figure 3: Simulation results (expected regrets) on the "End of Optimism" instance with different $\varepsilon$'s
Figure 4: Simulation results (CTRs) of the MovieLens dataset with different $K$'s

Theorems & Definitions (28)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Lemma 1
Corollary 1
proof
Lemma 2
Lemma 3
Lemma 4
...and 18 more

Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

TL;DR

Abstract

Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (28)