Learning from Delayed Feedback in Games via Extra Prediction
Yuma Fujimoto, Kenshi Abe, Kaito Ariu
TL;DR
This paper tackles the problem of time-delayed feedback in learning in games, showing that standard optimistic FTRL (OFTRL) loses both social regret optimality and convergence when delays occur. It introduces Weighted OFTRL (WOFTRL), which scales the optimistic term by a factor $n$, and proves that choosing $n=m+1$ (where $m$ is the delay) cancels the delay effects, yielding constant social regret and last-iterate convergence to Nash equilibria in poly-matrix zero-sum games. The authors provide RVU-based regret bounds, corollaries with $O(m^{2})$ regret under specific learning rates, and extensive experiments on Matching Pennies and Sato's game that validate the theory. They further show convergence guarantees in poly-matrix zero-sum settings with Legendre-type regularizers, including a last-iterate convergence result, and demonstrate these phenomena in Rock-Paper-Scissors experiments. Collectively, the work establishes a principled method to offset time-delay effects in multi-agent learning, with implications for delayed-feedback markets and multi-agent systems.
Abstract
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
