Table of Contents
Fetching ...

Beyond Softmax: A New Perspective on Gradient Bandits

Emerson Melo, David Müller

TL;DR

The paper addresses online decision-making with multiple arms by integrating discrete choice theory and gradient-based bandit methods. It broadens the gradient bandit paradigm beyond softmax (MNL) to Generalized Nested Logit (GNL) models, enabling correlated learning across actions and nested structures while preserving computationally tractable, closed-form sampling. The main contributions are: (i) sublinear regret bounds for a broad GBPA family including Exp3; (ii) a unified adversarial MAB framework based on GEV/GNL models with differential consistency; and (iii) a stochastic MAB generalization, the Generalized Gradient Bandit Algorithm, that extends Gradient Bandits to nested structures. The approach yields practical algorithms with improved exploration-exploitation trade-offs when arm correlations exist, and numerical experiments demonstrate improvements over standard softmax-based methods in structured environments.

Abstract

We establish a link between a class of discrete choice models and the theory of online learning and multi-armed bandits. Our contributions are: (i) sublinear regret bounds for a broad algorithmic family, encompassing Exp3 as a special case; (ii) a new class of adversarial bandit algorithms derived from generalized nested logit models \citep{wen:2001}; and (iii) \textcolor{black}{we introduce a novel class of generalized gradient bandit algorithms that extends beyond the widely used softmax formulation. By relaxing the restrictive independence assumptions inherent in softmax, our framework accommodates correlated learning dynamics across actions, thereby broadening the applicability of gradient bandit methods.} Overall, the proposed algorithms combine flexible model specification with computational efficiency via closed-form sampling probabilities. Numerical experiments in stochastic bandit settings demonstrate their practical effectiveness.

Beyond Softmax: A New Perspective on Gradient Bandits

TL;DR

The paper addresses online decision-making with multiple arms by integrating discrete choice theory and gradient-based bandit methods. It broadens the gradient bandit paradigm beyond softmax (MNL) to Generalized Nested Logit (GNL) models, enabling correlated learning across actions and nested structures while preserving computationally tractable, closed-form sampling. The main contributions are: (i) sublinear regret bounds for a broad GBPA family including Exp3; (ii) a unified adversarial MAB framework based on GEV/GNL models with differential consistency; and (iii) a stochastic MAB generalization, the Generalized Gradient Bandit Algorithm, that extends Gradient Bandits to nested structures. The approach yields practical algorithms with improved exploration-exploitation trade-offs when arm correlations exist, and numerical experiments demonstrate improvements over standard softmax-based methods in structured environments.

Abstract

We establish a link between a class of discrete choice models and the theory of online learning and multi-armed bandits. Our contributions are: (i) sublinear regret bounds for a broad algorithmic family, encompassing Exp3 as a special case; (ii) a new class of adversarial bandit algorithms derived from generalized nested logit models \citep{wen:2001}; and (iii) \textcolor{black}{we introduce a novel class of generalized gradient bandit algorithms that extends beyond the widely used softmax formulation. By relaxing the restrictive independence assumptions inherent in softmax, our framework accommodates correlated learning dynamics across actions, thereby broadening the applicability of gradient bandit methods.} Overall, the proposed algorithms combine flexible model specification with computational efficiency via closed-form sampling probabilities. Numerical experiments in stochastic bandit settings demonstrate their practical effectiveness.

Paper Structure

This paper contains 11 sections, 7 theorems, 74 equations, 4 figures.

Key Result

Theorem 2.1

Assume that the expectation of the maximum of the random errors is bounded above, i.e., $\mathbb{E}\left[ \max_{i \in A} \{\epsilon^{(i)}\} \right] \leq \alpha$. Then Algorithm algo:olo is Hannan-consistent, i.e., where $L = 2 \sum_{i=1}^{n} \sum_{j \neq i} g_{i,j}(\bar{z}_{i,j})$. Optimizing the scaling parameter $\eta$ yields:

Figures (4)

  • Figure 1: MNL environment
  • Figure 2: NL environment
  • Figure 3: NL environment learned average rewards
  • Figure 4: NL Large environment

Theorems & Definitions (20)

  • Definition 1
  • Example 2.1: MNL Model
  • Example 2.2: NL Model
  • Definition 2
  • Definition 3
  • Theorem 2.1
  • proof
  • Corollary 2.1
  • Lemma 3.1
  • proof
  • ...and 10 more