Minimum Empirical Divergence for Sub-Gaussian Linear Bandits
Kapilan Balagopalan, Kwang-Sung Jun
TL;DR
LinMED introduces a linear-MED framework for sub-Gaussian linear bandits that yields closed-form sampling probabilities via optimal experimental design, making it suitable for offline evaluation. It achieves near-optimal minimax regret $\tilde{O}(d\sqrt{n})$ and a refined instance-dependent bound that scales with the smallest gap $\Delta$, while remaining robust to under- or over-specification of the noise variance $\sigma_*^2$ and norm $\|\theta^*\|$. Theoretical guarantees are complemented by lower-bound results showing competitors like EXP2 and SpannerIGW can incur $\Omega(\Delta\sqrt{n})$ regret in some instances, highlighting LinMED's advantage. Empirically, LinMED performs competitively across delayed rewards, offline evaluation, and high-dimensional settings, with variants tuned to balance exploration and exploitation. The work offers practical OPE-friendly online learning with a principled design-based sampling mechanism and opens avenues for extensions to broader models and offline-benchmarking contexts.
Abstract
We propose a novel linear bandit algorithm called LinMED (Linear Minimum Empirical Divergence), which is a linear extension of the MED algorithm that was originally designed for multi-armed bandits. LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities, unlike the popular randomized algorithm called linear Thompson sampling. Such a feature proves useful for off-policy evaluation where the unbiased evaluation requires accurately computing the sampling probability. We prove that LinMED enjoys a near-optimal regret bound of $d\sqrt{n}$ up to logarithmic factors where $d$ is the dimension and $n$ is the time horizon. We further show that LinMED enjoys a $\frac{d^2}Δ\left(\log^2(n)\right)\log\left(\log(n)\right)$ problem-dependent regret where $Δ$ is the smallest sub-optimality gap. Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.
