Table of Contents
Fetching ...

Learning from Delayed Feedback in Games via Extra Prediction

Yuma Fujimoto, Kenshi Abe, Kaito Ariu

TL;DR

This paper tackles the problem of time-delayed feedback in learning in games, showing that standard optimistic FTRL (OFTRL) loses both social regret optimality and convergence when delays occur. It introduces Weighted OFTRL (WOFTRL), which scales the optimistic term by a factor $n$, and proves that choosing $n=m+1$ (where $m$ is the delay) cancels the delay effects, yielding constant social regret and last-iterate convergence to Nash equilibria in poly-matrix zero-sum games. The authors provide RVU-based regret bounds, corollaries with $O(m^{2})$ regret under specific learning rates, and extensive experiments on Matching Pennies and Sato's game that validate the theory. They further show convergence guarantees in poly-matrix zero-sum settings with Legendre-type regularizers, including a last-iterate convergence result, and demonstrate these phenomena in Rock-Paper-Scissors experiments. Collectively, the work establishes a principled method to offset time-delay effects in multi-agent learning, with implications for delayed-feedback markets and multi-agent systems.

Abstract

This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.

Learning from Delayed Feedback in Games via Extra Prediction

TL;DR

This paper tackles the problem of time-delayed feedback in learning in games, showing that standard optimistic FTRL (OFTRL) loses both social regret optimality and convergence when delays occur. It introduces Weighted OFTRL (WOFTRL), which scales the optimistic term by a factor , and proves that choosing (where is the delay) cancels the delay effects, yielding constant social regret and last-iterate convergence to Nash equilibria in poly-matrix zero-sum games. The authors provide RVU-based regret bounds, corollaries with regret under specific learning rates, and extensive experiments on Matching Pennies and Sato's game that validate the theory. They further show convergence guarantees in poly-matrix zero-sum settings with Legendre-type regularizers, including a last-iterate convergence result, and demonstrate these phenomena in Rock-Paper-Scissors experiments. Collectively, the work establishes a principled method to offset time-delay effects in multi-agent learning, with implications for delayed-feedback markets and multi-agent systems.

Abstract

This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.

Paper Structure

This paper contains 37 sections, 10 theorems, 87 equations, 3 figures.

Key Result

Theorem 6

Suppose that the strategy space is unconstrained ($\boldsymbol{x}_{i}\in\mathbb{R}^{|\mathcal{A}_{i}|}$) in Exm. exm_matching with the Euclidean regularizer ($h(\boldsymbol{x}_{i})=\|\boldsymbol{x}_{i}\|_{2}^{2}/2$). Then, when both players use OFTRL ($n=1$) with any time delay ($m\ge 1$), their soc

Figures (3)

  • Figure 1: Regret analysis for Matching Pennies (Exm. \ref{['exm_matching']}). A. The phase diagram of social regrets for various time delays $m$ (horizontal) and optimistic weights $n$ (vertical). The deep blue color indicates that the regret is small ($O(1)$-regret), while the green and yellow ones indicate that the regret is large ($O(\sqrt{T})$-regret). A transition is clearly shown between $O(1)$- and $O(\sqrt{T})$-regret. We set the parameters as $T=10^{5}$ and $\eta=10^{-2}$. B. The convergence of social regrets for various optimistic weights $n$ and a fixed time-delay $m=10$ (corresponding to the red broken line in Panel A). We see the regret oscillates and is relatively large for $n=1,3,5,7,9$ ($O(\sqrt{T})$-regret) but converges to a small value for $n=11,13,15$ ($O(1)$-regret). A transition is clearly observed again in $m=n$. We set the parameters as $\eta=10^{-2}$. C. The scale of social regrets in the case of $m=10$ and $n=11$ (corresponding to the red star in Panel A). We plot the two ways to take learning rate: $\eta=1/\sqrt{T}$ (blue dots) and $\eta=O(1)$ (orange ones). The regrets for $\eta=O(1/\sqrt{T})$ follow the broken blue line, which has a slope of $1/2$ (meaning that the regrets are $O(\sqrt{T})$). On the other hand, the regrets for $\eta=O(1)$ follow the broken orange line, which has a slope of $0$ (meaning that the regrets are $O(1)$). We set $\eta=1/\sqrt{T}$ for the blue dots and $\eta=10^{-2}$ for the orange ones.
  • Figure 2: Regret analysis for a Sato's game (Eqs. \ref{['sato']}) with the entropic regularizer, i.e., $h(\boldsymbol{x})=\left<\boldsymbol{x},\log\boldsymbol{x}\right>$. The results and parameter settings for all the panels are the same as those in Fig. \ref{['F01']}. A. The phase diagram of social regrets for various time delays $m$ (horizontal) and optimistic weights $n$ (vertical). B. The convergence of social regrets for various optimistic weights $n$ and a fixed time-delay $m=10$ (corresponding to the red broken line in Panel A). C. The scale of social regrets in the case of $m=10$ and $n=11$ (corresponding to the red star in Panel A).
  • Figure 3: Convergence analysis for a weighted Rock-Paper-Scissors with the entropic regularizer, i.e., $h(\boldsymbol{x}_{i})=\left<\boldsymbol{x}_{i},\log\boldsymbol{x}_{i}\right>$. We set the parameters as $\eta=10^{-1}$ and $m=4$. The colored lines are the trajectories of learning. The black dot, black star, and white star indicate the initial state, final state, and Nash equilibrium, respectively. From left to right, optimistic weights are $n=3,4,5,6$. In the left two panels ($n\le m$), the black star does not overlap the white one, meaning non-convergence. On the other hand, in the right two panels ($n>m$), the black star overlaps the white one, indicating last-iterate convergence.

Theorems & Definitions (25)

  • Definition 1: Online learning with time delay
  • Definition 2: Generalized FTRL with time delay
  • Example 5: Matching Pennies
  • Theorem 6: Social regret of $\Omega(\sqrt{T})$ by time delay
  • Theorem 7: Divergence by time delay
  • Definition 9: RVU property
  • Theorem 10: RVU property when $n=m+1$
  • Corollary 11: Constant social regret with time delay
  • proof
  • Definition 12: Poly-matrix zero-sum games
  • ...and 15 more