Optimal Thresholding Linear Bandit
Eduardo Ochoa Rivera, Ambuj Tewari
TL;DR
This work studies the $\epsilon$-Thresholding Bandit Problem (TBP) in stochastic linear bandits, with the goal of identifying all arms whose mean rewards exceed a threshold $\rho$ under fixed confidence. It derives an instance-specific finite-sample lower bound on the expected number of samples and develops a Track-and-Stop–style algorithm that is asymptotically optimal for TBP in the linear setting, using a least-squares estimator with enforced exploration, a sampling rule that targets the optimal design over arms, and a stopping rule based on a generalized likelihood ratio. The analysis shows that the set of optimal sampling proportions is convex, enabling convergence of the algorithm to the optimal region, and provides both high-probability and in-expectation guarantees on the stopping time that match the lower bound up to constants. Empirically, the method demonstrates competitive performance against baseline pure-exploration algorithms on TBP-specific benchmarks and supports potential extensions to relaxed threshold settings and generalized linear models for broader scientific discovery tasks.
Abstract
We study a novel pure exploration problem: the $ε$-Thresholding Bandit Problem (TBP) with fixed confidence in stochastic linear bandits. We prove a lower bound for the sample complexity and extend an algorithm designed for Best Arm Identification in the linear case to TBP that is asymptotically optimal.
