Beyond Softmax: A New Perspective on Gradient Bandits
Emerson Melo, David Müller
TL;DR
The paper addresses online decision-making with multiple arms by integrating discrete choice theory and gradient-based bandit methods. It broadens the gradient bandit paradigm beyond softmax (MNL) to Generalized Nested Logit (GNL) models, enabling correlated learning across actions and nested structures while preserving computationally tractable, closed-form sampling. The main contributions are: (i) sublinear regret bounds for a broad GBPA family including Exp3; (ii) a unified adversarial MAB framework based on GEV/GNL models with differential consistency; and (iii) a stochastic MAB generalization, the Generalized Gradient Bandit Algorithm, that extends Gradient Bandits to nested structures. The approach yields practical algorithms with improved exploration-exploitation trade-offs when arm correlations exist, and numerical experiments demonstrate improvements over standard softmax-based methods in structured environments.
Abstract
We establish a link between a class of discrete choice models and the theory of online learning and multi-armed bandits. Our contributions are: (i) sublinear regret bounds for a broad algorithmic family, encompassing Exp3 as a special case; (ii) a new class of adversarial bandit algorithms derived from generalized nested logit models \citep{wen:2001}; and (iii) \textcolor{black}{we introduce a novel class of generalized gradient bandit algorithms that extends beyond the widely used softmax formulation. By relaxing the restrictive independence assumptions inherent in softmax, our framework accommodates correlated learning dynamics across actions, thereby broadening the applicability of gradient bandit methods.} Overall, the proposed algorithms combine flexible model specification with computational efficiency via closed-form sampling probabilities. Numerical experiments in stochastic bandit settings demonstrate their practical effectiveness.
