Table of Contents
Fetching ...

Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits

Mengmeng Li, Daniel Kuhn, Bahar Taşkesen

TL;DR

The paper tackles the challenge of achieving best-of-both-worlds regret in multi-armed bandits while preserving computational efficiency. It introduces Distributionally Optimistic Perturbations (DOPA), a GBPA variant that uses marginal ambiguity sets to induce optimistic perturbations, linking FTPL and FTRL via an additively separable regularizer $\psi$ and the gradient relationship $\nabla_{\boldsymbol u} \Phi^R(\boldsymbol u; \psi)=\nabla_{\boldsymbol u} \Phi(\boldsymbol u; \mathcal{B})$. By proving equivalence between FTRL with separable regularizers and FTPL under appropriate ambiguity sets, and by deriving regret bounds under Fréchet marginals (e.g., shifted Pareto) that achieve $\mathcal{R}(T)=\mathcal{O}(\sqrt{KT})$ in adversarial settings and $\mathcal{O}(\log T)$ in stochastic settings, the approach unifies the two paradigms with a computationally efficient perturbation mechanism. A fast, bisection-based method computes arm-sampling probabilities with $\mathcal{O}(K)$ per-iteration complexity, yielding up to $10^4$× speedups over classical FTRL while retaining BOBW guarantees, and enabling broader applicability to hybrid regularizers and beyond-bandit problems. The work thus provides a practical, theory-backed bridge between regularization and perturbation in online learning.

Abstract

Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as stochastic bandit problems and allow for a streamlined analysis. Nonetheless, FTRL algorithms require the solution of an optimization problem in every iteration and are thus computationally challenging. In contrast, Follow-The-Perturbed-Leader (FTPL) algorithms achieve computational efficiency by perturbing the estimates of the rewards of the arms, but their regret analysis is cumbersome. We propose a new FTPL algorithm that generates optimal policies for both adversarial and stochastic multi-armed bandits. Like FTRL, our algorithm admits a unified regret analysis, and similar to FTPL, it offers low computational costs. Unlike existing FTPL algorithms that rely on independent additive disturbances governed by a \textit{known} distribution, we allow for disturbances governed by an \textit{ambiguous} distribution that is only known to belong to a given set and propose a principle of optimism in the face of ambiguity. Consequently, our framework generalizes existing FTPL algorithms. It also encapsulates a broad range of FTRL methods as special cases, including several optimal ones, which appears to be impossible with current FTPL methods. Finally, we use techniques from discrete choice theory to devise an efficient bisection algorithm for computing the optimistic arm sampling probabilities. This algorithm is up to $10^4$ times faster than standard FTRL algorithms that solve an optimization problem in every iteration. Our results not only settle existing conjectures but also provide new insights into the impact of perturbations by mapping FTRL to FTPL.

Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits

TL;DR

The paper tackles the challenge of achieving best-of-both-worlds regret in multi-armed bandits while preserving computational efficiency. It introduces Distributionally Optimistic Perturbations (DOPA), a GBPA variant that uses marginal ambiguity sets to induce optimistic perturbations, linking FTPL and FTRL via an additively separable regularizer and the gradient relationship . By proving equivalence between FTRL with separable regularizers and FTPL under appropriate ambiguity sets, and by deriving regret bounds under Fréchet marginals (e.g., shifted Pareto) that achieve in adversarial settings and in stochastic settings, the approach unifies the two paradigms with a computationally efficient perturbation mechanism. A fast, bisection-based method computes arm-sampling probabilities with per-iteration complexity, yielding up to × speedups over classical FTRL while retaining BOBW guarantees, and enabling broader applicability to hybrid regularizers and beyond-bandit problems. The work thus provides a practical, theory-backed bridge between regularization and perturbation in online learning.

Abstract

Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as stochastic bandit problems and allow for a streamlined analysis. Nonetheless, FTRL algorithms require the solution of an optimization problem in every iteration and are thus computationally challenging. In contrast, Follow-The-Perturbed-Leader (FTPL) algorithms achieve computational efficiency by perturbing the estimates of the rewards of the arms, but their regret analysis is cumbersome. We propose a new FTPL algorithm that generates optimal policies for both adversarial and stochastic multi-armed bandits. Like FTRL, our algorithm admits a unified regret analysis, and similar to FTPL, it offers low computational costs. Unlike existing FTPL algorithms that rely on independent additive disturbances governed by a \textit{known} distribution, we allow for disturbances governed by an \textit{ambiguous} distribution that is only known to belong to a given set and propose a principle of optimism in the face of ambiguity. Consequently, our framework generalizes existing FTPL algorithms. It also encapsulates a broad range of FTRL methods as special cases, including several optimal ones, which appears to be impossible with current FTPL methods. Finally, we use techniques from discrete choice theory to devise an efficient bisection algorithm for computing the optimistic arm sampling probabilities. This algorithm is up to times faster than standard FTRL algorithms that solve an optimization problem in every iteration. Our results not only settle existing conjectures but also provide new insights into the impact of perturbations by mapping FTRL to FTPL.
Paper Structure (13 sections, 13 theorems, 48 equations, 1 figure, 3 algorithms)

This paper contains 13 sections, 13 theorems, 48 equations, 1 figure, 3 algorithms.

Key Result

Lemma 3.1

natarajan2009persistency If $\mathcal{B}$ is a marginal ambiguity set of the form eq:marginal:set and if the cumulative distribution functions $F_k, k \in[K]$, are continuous and strictly increasing in $s$ whenever $F_k(s)\in(0,1)$, then the potential function eq:discrete-best-case is convex and dif In addition, the unique maximizer of the convex program eq:frechet-reg-max is given by $\boldsymbol

Figures (1)

  • Figure 1: Bisection method for approximating the arm-sampling distribution $\boldsymbol p= \nabla_{\boldsymbol u} \Phi(\boldsymbol u; \mathcal{B})$

Theorems & Definitions (30)

  • Remark 1: Exp3 algorithm
  • Definition 1: Marginal ambiguity set
  • Lemma 3.1
  • Proposition 3.2: FTRL vs. DOPA
  • proof
  • Proposition 3.3: FTPL vs. DOPA
  • proof
  • Theorem 3.4: FTRL vs. FTPL
  • proof
  • Theorem 4.1: Regret analysis of DOPA
  • ...and 20 more