Table of Contents
Fetching ...

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

Taira Tsuchiya, Shinji Ito, Junya Honda

TL;DR

The paper advances online learning in partial monitoring by integrating Exploration by Optimization (ExO) with a novel hybrid regularizer (log-barrier plus complement negative Shannon entropy) within Follow-the-Regularized-Leader. This yields substantially improved, problem-dependent regret bounds: in locally observable PM, a stochastic bound on the order of $O\left(\sum_{a \\neq a^*} {k^2 m^2 \log T}/{\Delta_a}\right)$ and an adversarial bound on the order of $O(k^{3/2} m \sqrt{T \log T})$, while globally observable PM admits the first $O(\log T)$ stochastic bound. The work introduces a k-independent feasible region $\\mathcal{R}(q)$ and a water-transfer operator to bound the ExO objective, enabling sharper, dimension-free control of the stability/penalty terms. A globally observable PM algorithm achieving $O(\log T)$ stochastic regret demonstrates the framework’s broad applicability. These results offer near-optimal, best-of-both-worlds guarantees in PM and deepen understanding of how hybrid regularizers interact with limited-feedback online learning.

Abstract

Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, \emph{exploration by optimization} (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2 m^2 \log T / Δ_a)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds, $a^*$ is an optimal action, and $Δ_a$ is the suboptimality gap for action $a$. This bound is roughly $Θ(k^2 \log T)$ times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic bound.

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

TL;DR

The paper advances online learning in partial monitoring by integrating Exploration by Optimization (ExO) with a novel hybrid regularizer (log-barrier plus complement negative Shannon entropy) within Follow-the-Regularized-Leader. This yields substantially improved, problem-dependent regret bounds: in locally observable PM, a stochastic bound on the order of and an adversarial bound on the order of , while globally observable PM admits the first stochastic bound. The work introduces a k-independent feasible region and a water-transfer operator to bound the ExO objective, enabling sharper, dimension-free control of the stability/penalty terms. A globally observable PM algorithm achieving stochastic regret demonstrates the framework’s broad applicability. These results offer near-optimal, best-of-both-worlds guarantees in PM and deepen understanding of how hybrid regularizers interact with limited-feedback online learning.

Abstract

Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, \emph{exploration by optimization} (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of , where , , and are the numbers of actions, observations and rounds, is an optimal action, and is the suboptimality gap for action . This bound is roughly times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first stochastic bound.
Paper Structure (40 sections, 14 theorems, 94 equations, 2 tables, 2 algorithms)

This paper contains 40 sections, 14 theorems, 94 equations, 2 tables, 2 algorithms.

Key Result

Lemma 2

For any globally observable game, there exists a function $G \in \mathcal{H}$ such that for any Pareto optimal actions $a, b \in \Pi$, In particular, the following $G^{\circ}$ satisfies eq:Gdiff_Ldiff: where $\mathrm{path}_\mathscr{T}(b)$ is the set of edges from $b \in \Pi$ to the root on $\mathscr{T}$.

Theorems & Definitions (32)

  • Definition 1
  • Lemma 2: lattimore20exploration
  • Lemma 3
  • Remark 1
  • Remark 2
  • Lemma 4
  • Remark 3
  • Theorem 5
  • Remark 4
  • Lemma 6
  • ...and 22 more