Non-stochastic Bandits With Evolving Observations

Yogev Bar-On; Yishay Mansour

Non-stochastic Bandits With Evolving Observations

Yogev Bar-On, Yishay Mansour

TL;DR

This paper addresses online learning when action feedback evolves adversarially over time, unifying delayed, corrupted, and composite feedback into a single framework. It proposes two algorithms, Evolving Exponential Weights for the full-information setting and Evolving FTRL for the bandit setting, with regret bounds that scale with the total feedback inaccuracy $\Lambda$ and recover known results in special cases. The results show asymptotically optimal regret (up to logarithmic factors) that adapt to how accurately observed feedback tracks the true losses, and they introduce a skipping technique to handle unbounded delays. The framework is demonstrated through applications to optimistic delayed feedback, corrupted feedback, and composite delayed feedback, highlighting practical relevance for finance and online advertising.

Abstract

We introduce a novel online learning framework that unifies and generalizes pre-established models, such as delayed and corrupted feedback, to encompass adversarial environments where action feedback evolves over time. In this setting, the observed loss is arbitrary and may not correlate with the true loss incurred, with each round updating previous observations adversarially. We propose regret minimization algorithms for both the full-information and bandit settings, with regret bounds quantified by the average feedback accuracy relative to the true loss. Our algorithms match the known regret bounds across many special cases, while also introducing previously unknown bounds.

Non-stochastic Bandits With Evolving Observations

TL;DR

and recover known results in special cases. The results show asymptotically optimal regret (up to logarithmic factors) that adapt to how accurately observed feedback tracks the true losses, and they introduce a skipping technique to handle unbounded delays. The framework is demonstrated through applications to optimistic delayed feedback, corrupted feedback, and composite delayed feedback, highlighting practical relevance for finance and online advertising.

Abstract

Paper Structure (30 sections, 20 theorems, 85 equations, 3 algorithms)

This paper contains 30 sections, 20 theorems, 85 equations, 3 algorithms.

Introduction
Evolving feedback
Contributions and outline
Full-information setting
Bandit setting
Applications
Additional related works
Evolving Exponential Weights
Analysis
Evolving FTRL
Loss estimates
Feedback accuracy measure
Analysis
Drift terms
Bounds
...and 15 more sections

Key Result

Lemma 1

Computing $p$ as in Eq. (eq:prob-def), we have for any action $a\in[K]$:

Theorems & Definitions (27)

Lemma 1
Lemma 2
Theorem 1
Corollary 1
Lemma 3
Lemma 4
Theorem 2
Corollary 2
Lemma 5
Corollary 3
...and 17 more

Non-stochastic Bandits With Evolving Observations

TL;DR

Abstract

Non-stochastic Bandits With Evolving Observations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)