Self-Concordant Perturbations for Linear Bandits
Lucas Lévy, Jean-Lou Valeau, Arya Akhavan, Patrick Rebeschini
TL;DR
This work tackles adversarial linear bandits by unifying Follow-the-Regularized-Leader and Follow-the-Perturbed-Leader via a Bandits-GBPA framework, augmented with self-concordant perturbations that replicate barrier-induced smoothness in FTPL. The authors introduce SC-FTPL, a perturbation-based algorithm that concentrates exploration at extreme points and leverages self-concordant barriers to control regret. For the hypercube and Euclidean ball action sets, they construct explicit perturbations and prove regret bounds of $O(d\sqrt{n\ln n})$, with a $\sqrt{d}$ improvement over SCRiBLe on the hypercube and matching rates on the ball up to logarithmic factors, while achieving per-round complexity linear in $d$. They also highlight the heavy-tailed nature of self-concordant perturbations and provide a detailed complexity analysis. Overall, the work advances tractable, principled strategies for adversarial linear bandits across canonical geometries and opens avenues for extending self-concordant perturbations to broader convex sets and learning settings.
Abstract
We study the adversarial linear bandits problem and present a unified algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection between them from the full-information setting. Within this framework, we introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers previously employed in the FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based algorithm that combines self-concordant regularization with efficient stochastic exploration. Our approach achieves a regret of $O(d\sqrt{n \ln n})$ on both the $d$-dimensional hypercube and the Euclidean ball. On the Euclidean ball, this matches the rate attained by existing self-concordant FTRL methods. For the hypercube, this represents a $\sqrt{d}$ improvement over these methods and matches the optimal bound up to logarithmic factors.
