Table of Contents
Fetching ...

Can a Learner Regret Using a No-Regret Algorithm? A Control-Theoretic Study of Performance Dominance

Hassan Abdelraouf, Jeff S. Shamma

TL;DR

It is shown that the minimal achievable cumulative reward gap is zero, thereby establishing global dominance of anticipatory RD across all payoff environments and establishing a"free lunch" among no-regret learning dynamics.

Abstract

No-regret learning dynamics ensure that a learner asymptotically achieves an average reward no worse than that of any fixed strategy. This no-regret guarantee does not determine the value of the asymptotic average reward. Indeed, it is possible for different no-regret learning dynamics to exhibit different asymptotic average rewards when facing the same environment while both assure the no-regret guarantee. This paper asks whether a "free-lunch" phenomenon can arise among no-regret algorithms. Namely, is it possible for one no-regret learning rule to uniformly outperform another no-regret learning rule across all payoff environments. Stated differently, can a learner regret not using a particular no-regret algorithm? We consider generalized replicator dynamics (RD) as a cascade interconnection between a linear time-invariant (LTI) system and the softmax nonlinearity. Varying this LTI system leads to different realizations of replicator dynamics, including so-called anticipatory RD, exponential RD, and other forms of higher-order RD. Setting the LTI system to be an integrator realizes standard RD, which is known to satisfy the no-regret property. Within this framework, we analyze and compare various realizations of these generalized realizations RD by varying the LTI system. We first formulate performance comparison as a passivity property of an associated comparison system and establish "local" dominance results, i.e., comparing the asymptotic performance near an equilibrium payoff vector. We then cast performance comparison between a form of anticipatory RD and standard RD as an optimal-control problem. We show that the minimal achievable cumulative reward gap is zero, thereby establishing global dominance of anticipatory RD across all payoff environments and establishing a "free lunch" among no-regret learning dynamics.

Can a Learner Regret Using a No-Regret Algorithm? A Control-Theoretic Study of Performance Dominance

TL;DR

It is shown that the minimal achievable cumulative reward gap is zero, thereby establishing global dominance of anticipatory RD across all payoff environments and establishing a"free lunch" among no-regret learning dynamics.

Abstract

No-regret learning dynamics ensure that a learner asymptotically achieves an average reward no worse than that of any fixed strategy. This no-regret guarantee does not determine the value of the asymptotic average reward. Indeed, it is possible for different no-regret learning dynamics to exhibit different asymptotic average rewards when facing the same environment while both assure the no-regret guarantee. This paper asks whether a "free-lunch" phenomenon can arise among no-regret algorithms. Namely, is it possible for one no-regret learning rule to uniformly outperform another no-regret learning rule across all payoff environments. Stated differently, can a learner regret not using a particular no-regret algorithm? We consider generalized replicator dynamics (RD) as a cascade interconnection between a linear time-invariant (LTI) system and the softmax nonlinearity. Varying this LTI system leads to different realizations of replicator dynamics, including so-called anticipatory RD, exponential RD, and other forms of higher-order RD. Setting the LTI system to be an integrator realizes standard RD, which is known to satisfy the no-regret property. Within this framework, we analyze and compare various realizations of these generalized realizations RD by varying the LTI system. We first formulate performance comparison as a passivity property of an associated comparison system and establish "local" dominance results, i.e., comparing the asymptotic performance near an equilibrium payoff vector. We then cast performance comparison between a form of anticipatory RD and standard RD as an optimal-control problem. We show that the minimal achievable cumulative reward gap is zero, thereby establishing global dominance of anticipatory RD across all payoff environments and establishing a "free lunch" among no-regret learning dynamics.
Paper Structure (28 sections, 12 theorems, 148 equations, 8 figures)

This paper contains 28 sections, 12 theorems, 148 equations, 8 figures.

Key Result

Lemma 1

For any $v\in\mathbb{R}^n$, with equality if and only if $v=c\,\mathbf{1}_n$ for some $c\in\mathbb{R}$.

Figures (8)

  • Figure 1: Block–diagram representation of replicator dynamics.
  • Figure 2: Block–diagram representation of exponential replicator dynamics (Ex–RD).
  • Figure 3: Performance of different learning dynamics in the environment $p(t)=\sin t0.5^\top$.
  • Figure 4: Performance of different learning dynamics in the environment $p(t)=\sin t-\sin t^\top$.
  • Figure 5: Block diagram representation for the anticipatory RD.
  • ...and 3 more figures

Theorems & Definitions (30)

  • Remark 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Example 1: Finite–regret dominance abdelraouf2025passivity
  • Example 2
  • Definition 1: Uniform Dominance
  • Definition 2: Asymptotic Dominance
  • Remark 2
  • ...and 20 more