Table of Contents
Fetching ...

Online Learning with Bounded Recall

Jon Schneider, Kiran Vodrahalli

TL;DR

This work analyzes full-information online learning under bounded recall, where decisions rely only on the last $M$ rewards. It proves fundamental limits for bounded-recall learners, showing that naive mean-based windowing yields constant or suboptimal regret, and then constructs stationary bounded-recall algorithms with near-optimal $O\left(\sqrt{(\log d)/M}\right)$ per-round regret using AverageRestart and AverageRestartFullHorizon. A key insight is that asymmetry in how past rounds are weighed is essential; symmetric bounded-recall algorithms cannot achieve sublinear regret. Empirical results corroborate the theoretical findings, demonstrating improved performance of the proposed bounded-recall methods in drifting and non-stationary environments, with implications for privacy-preserving and streaming learning scenarios.

Abstract

We study the problem of full-information online learning in the "bounded recall" setting popular in the study of repeated games. An online learning algorithm $\mathcal{A}$ is $M$-$\textit{bounded-recall}$ if its output at time $t$ can be written as a function of the $M$ previous rewards (and not e.g. any other internal state of $\mathcal{A}$). We first demonstrate that a natural approach to constructing bounded-recall algorithms from mean-based no-regret learning algorithms (e.g., running Hedge over the last $M$ rounds) fails, and that any such algorithm incurs constant regret per round. We then construct a stationary bounded-recall algorithm that achieves a per-round regret of $Θ(1/\sqrt{M})$, which we complement with a tight lower bound. Finally, we show that unlike the perfect recall setting, any low regret bound bounded-recall algorithm must be aware of the ordering of the past $M$ losses -- any bounded-recall algorithm which plays a symmetric function of the past $M$ losses must incur constant regret per round.

Online Learning with Bounded Recall

TL;DR

This work analyzes full-information online learning under bounded recall, where decisions rely only on the last rewards. It proves fundamental limits for bounded-recall learners, showing that naive mean-based windowing yields constant or suboptimal regret, and then constructs stationary bounded-recall algorithms with near-optimal per-round regret using AverageRestart and AverageRestartFullHorizon. A key insight is that asymmetry in how past rounds are weighed is essential; symmetric bounded-recall algorithms cannot achieve sublinear regret. Empirical results corroborate the theoretical findings, demonstrating improved performance of the proposed bounded-recall methods in drifting and non-stationary environments, with implications for privacy-preserving and streaming learning scenarios.

Abstract

We study the problem of full-information online learning in the "bounded recall" setting popular in the study of repeated games. An online learning algorithm is - if its output at time can be written as a function of the previous rewards (and not e.g. any other internal state of ). We first demonstrate that a natural approach to constructing bounded-recall algorithms from mean-based no-regret learning algorithms (e.g., running Hedge over the last rounds) fails, and that any such algorithm incurs constant regret per round. We then construct a stationary bounded-recall algorithm that achieves a per-round regret of , which we complement with a tight lower bound. Finally, we show that unlike the perfect recall setting, any low regret bound bounded-recall algorithm must be aware of the ordering of the past losses -- any bounded-recall algorithm which plays a symmetric function of the past losses must incur constant regret per round.
Paper Structure (21 sections, 11 theorems, 21 equations, 2 figures, 4 algorithms)

This paper contains 21 sections, 11 theorems, 21 equations, 2 figures, 4 algorithms.

Key Result

Theorem 3.1

Fix an $M > 0$. Then for any $M$-bounded-recall learning algorithm $\mathcal{A}$ and $T > M$, there exists a distribution $\mathcal{D}$ over online learning instances $\mathbf{r}{}$ of length $T$ with $d$ actions such that

Figures (2)

  • Figure 1: A plot of $\Delta_t$ over time, as used in Lemma \ref{['lem:counterexample']}.
  • Figure 2: (Left) We plot the total regret of the algorithms over time over a uniform average of high-frequency drifting scenarios where the periods of the mean reward of arm $1$ are $T/20, T/10, T/5,$ and $T/2$ and arm $2$ flips an unbiased coin for reward $\{\pm 1\}$ -- the bounded-recall algorithms significantly outperform the classic no-regret algorithms. (Right) We plot the total regret of the algorithms over time for one block of the adversarial rewards case (see the construction in Lemma \ref{['lem:counterexample']}) -- observe that the mean-based bounded-recall learner attains regret on order $M/6$ (here, $M = T/3$), while our no-regret bounded-recall learners all outperform Multiplicative Weights.

Theorems & Definitions (28)

  • Definition 2.1: Per-round Regret
  • Definition 2.2: Bounded-Recall Online Learning Algorithms
  • Definition 2.3: Mean-based algorithm
  • Theorem 3.1: Lower Bound
  • proof
  • Theorem 3.2
  • proof
  • Corollary 3.3
  • proof
  • Theorem 4.1
  • ...and 18 more