Table of Contents
Fetching ...

COMPASS-Hedge: Learning Safely Without Knowing the World

Ting Hu, Luanda Cai, Manolis Vlatakis

Abstract

Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

COMPASS-Hedge: Learning Safely Without Knowing the World

Abstract

Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.
Paper Structure (73 sections, 12 theorems, 56 equations, 3 figures, 1 algorithm)

This paper contains 73 sections, 12 theorems, 56 equations, 3 figures, 1 algorithm.

Key Result

Lemma 3.1

Let $I \subseteq \{1,\dots,T\}$ be any interval. Then, where $\mathcal{R}_{\mathrm{exp}}(I)$ and $\mathcal{R}_{\mathrm{pseudo}}(I)$ denote, respectively, the expected regret and the pseudo-regret accumulated over $I$. Moreover, the remainder term $\mathbb{G}(I)$ satisfies

Figures (3)

  • Figure 1: The Best-of-Three-Worlds landscape. Each circle represents one desideratum. Prior algorithms (annotations) cover at most two regions simultaneously. Compass-Hedge (center) is the first to unify all three in a single, parameter-free algorithm.
  • Figure 2: Median Regret (500 trials). Shaded regions denote $10^{th}$--$90^{th}$ percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert (near-zero regret). (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance (risk). Compass-Hedge maintains a flat trajectory near zero comparator regret (d), empirically confirming the $O(1)$ safety bound.
  • Figure 3: Median Regret (500 trials). Shaded regions denote $10^{th}$--$90^{th}$ percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert. (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance. Compass-Hedge maintains a flat trajectory (d), confirming the $O(1)$ safety bound.

Theorems & Definitions (27)

  • Definition 3.1: Expected Regret
  • Definition 3.2: Comparator Regret
  • Definition 3.3: Pseudo-Regret
  • Remark 3.1: Discussion on the Gap.
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4: On Identifiability
  • Lemma 3.1: Pseudo-regret versus expected regret
  • Remark 3.5
  • Theorem 4.1: Universal Regret Guarantees
  • ...and 17 more