Table of Contents
Fetching ...

Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure

Aleksandrs Slivkins, Yunzong Xu, Shiliang Zuo

TL;DR

This work provides a sharp, instance-level theory for the greedy algorithm in structured bandits. It identifies self-identifiability as the exact per-instance condition distinguishing sublinear from linear regret, and shows that decoys analogous to adversarial reward models cause permanent misidentification and failure. The results extend from plain MAB to structured and contextual bandits, to interactive feedback via DMSO, and to finite as well as infinite reward structures through margin-based partial characterizations. The insights reveal that Greedy succeeds only when the problem is intrinsically easy for any mild non-degenerate algorithm, with broad implications for practical deployment and future theoretical refinements.

Abstract

We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy -- any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. Our characterization extends to contextual bandits and interactive decision-making with arbitrary feedback. Examples demonstrating broad applicability and extensions to infinite reward structures are provided.

Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure

TL;DR

This work provides a sharp, instance-level theory for the greedy algorithm in structured bandits. It identifies self-identifiability as the exact per-instance condition distinguishing sublinear from linear regret, and shows that decoys analogous to adversarial reward models cause permanent misidentification and failure. The results extend from plain MAB to structured and contextual bandits, to interactive feedback via DMSO, and to finite as well as infinite reward structures through margin-based partial characterizations. The insights reveal that Greedy succeeds only when the problem is intrinsically easy for any mild non-degenerate algorithm, with broad implications for practical deployment and future theoretical refinements.

Abstract

We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy -- any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. Our characterization extends to contextual bandits and interactive decision-making with arbitrary feedback. Examples demonstrating broad applicability and extensions to infinite reward structures are provided.

Paper Structure

This paper contains 28 sections, 38 theorems, 104 equations.

Key Result

Theorem 3

Fix a problem instance $(f^*,\mathcal{F})$ of $\mathtt{StructuredMAB}$.

Theorems & Definitions (53)

  • Definition 1: Self-identifiability
  • Definition 2: Decoy
  • Claim 1
  • Theorem 3
  • Definition 4: Self-identifiability
  • Definition 5: Decoy
  • Theorem 6
  • Remark 7
  • Definition 8: Self-identifiability
  • Definition 9: Decoy
  • ...and 43 more