COMPASS-Hedge: Learning Safely Without Knowing the World

Ting Hu; Luanda Cai; Manolis Vlatakis

COMPASS-Hedge: Learning Safely Without Knowing the World

Ting Hu, Luanda Cai, Manolis Vlatakis

Abstract

Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

COMPASS-Hedge: Learning Safely Without Knowing the World

Abstract

regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

Paper Structure (73 sections, 12 theorems, 56 equations, 3 figures, 1 algorithm)

This paper contains 73 sections, 12 theorems, 56 equations, 3 figures, 1 algorithm.

Introduction
Technical Bottleneck: The Trade-off between Conservation and Aggression.
Our Contributions.
1. Autonomous Aggression via Geometric Scaling.
2. Bridging the Expected– Pseudo Regret Gap.
3. Horizon-Free Self-Scaling and Comparator Mixing.
Related Work and Motivating Scenarios
Bicriteria Measures and Baseline Safety.
The Practical Imperative of Full Information.
Preliminaries and Problem Setup
Notation.
Interaction Protocol.
Filtration and Adversarial Models.
Performance Metrics
From Expected to Pseudo-Regret.
...and 58 more sections

Key Result

Lemma 3.1

Let $I \subseteq \{1,\dots,T\}$ be any interval. Then, where $\mathcal{R}_{\mathrm{exp}}(I)$ and $\mathcal{R}_{\mathrm{pseudo}}(I)$ denote, respectively, the expected regret and the pseudo-regret accumulated over $I$. Moreover, the remainder term $\mathbb{G}(I)$ satisfies

Figures (3)

Figure 1: The Best-of-Three-Worlds landscape. Each circle represents one desideratum. Prior algorithms (annotations) cover at most two regions simultaneously. Compass-Hedge (center) is the first to unify all three in a single, parameter-free algorithm.
Figure 2: Median Regret (500 trials). Shaded regions denote $10^{th}$--$90^{th}$ percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert (near-zero regret). (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance (risk). Compass-Hedge maintains a flat trajectory near zero comparator regret (d), empirically confirming the $O(1)$ safety bound.
Figure 3: Median Regret (500 trials). Shaded regions denote $10^{th}$--$90^{th}$ percentiles. (a, c) With Oracle prior, Compass-Hedge (blue) matches the best expert. (b, d) With Uniform prior, Standard Hedge (orange) exhibits high variance. Compass-Hedge maintains a flat trajectory (d), confirming the $O(1)$ safety bound.

Theorems & Definitions (27)

Definition 3.1: Expected Regret
Definition 3.2: Comparator Regret
Definition 3.3: Pseudo-Regret
Remark 3.1: Discussion on the Gap.
Remark 3.2
Remark 3.3
Remark 3.4: On Identifiability
Lemma 3.1: Pseudo-regret versus expected regret
Remark 3.5
Theorem 4.1: Universal Regret Guarantees
...and 17 more

COMPASS-Hedge: Learning Safely Without Knowing the World

Abstract

COMPASS-Hedge: Learning Safely Without Knowing the World

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (27)