Non-stochastic Bandits With Evolving Observations
Yogev Bar-On, Yishay Mansour
TL;DR
This paper addresses online learning when action feedback evolves adversarially over time, unifying delayed, corrupted, and composite feedback into a single framework. It proposes two algorithms, Evolving Exponential Weights for the full-information setting and Evolving FTRL for the bandit setting, with regret bounds that scale with the total feedback inaccuracy $\Lambda$ and recover known results in special cases. The results show asymptotically optimal regret (up to logarithmic factors) that adapt to how accurately observed feedback tracks the true losses, and they introduce a skipping technique to handle unbounded delays. The framework is demonstrated through applications to optimistic delayed feedback, corrupted feedback, and composite delayed feedback, highlighting practical relevance for finance and online advertising.
Abstract
We introduce a novel online learning framework that unifies and generalizes pre-established models, such as delayed and corrupted feedback, to encompass adversarial environments where action feedback evolves over time. In this setting, the observed loss is arbitrary and may not correlate with the true loss incurred, with each round updating previous observations adversarially. We propose regret minimization algorithms for both the full-information and bandit settings, with regret bounds quantified by the average feedback accuracy relative to the true loss. Our algorithms match the known regret bounds across many special cases, while also introducing previously unknown bounds.
