Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games
Brian Hu Zhang, Ioannis Anagnostides, Tuomas Sandholm
TL;DR
Addressing the long-standing gap between theory and practice in zero-sum online learning, the paper introduces a scale-invariant, parameter-free variant of regret matching (IREG-PRM$^+$) and an adaptive optimistic gradient descent (AdOGD) with RVU-type guarantees, achieving $O_T(1/T)$ average-iterate and $O_T(1/ olinebreak[4] olinebreak[4sqrt]{T})$ best-iterate convergence. It further develops IR-PRM and IR-PRM$^+$ with predictions and a nondecreasing regret norm, plus an extragradient variant (IREG-PRM$^+$ EG) that yields $O_T(1/T)$ equilibrium guarantees in zero-sum games, all while maintaining competitive performance in benchmarks. The work unifies regret-matching with gradient-based optimization, clarifying why RM-based methods perform well in practice and delivering parameter-free, scale-invariant algorithms with strong theoretical and empirical convergence guarantees. Overall, it closes the theory-practice gap in zero-sum game solving and provides practical, scalable tools for self-play and adversarial learning.
Abstract
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established since Nemirovski's mirror-prox algorithm and Nesterov's excessive gap technique in the early 2000s, the most effective paradigm in practice is *counterfactual regret minimization*, which is based on *regret matching* and its modern variants. In particular, the state of the art across most benchmarks is *predictive* regret matching$^+$ (PRM$^+$), in conjunction with non-uniform averaging. Yet, such algorithms can exhibit slower $Ω(T^{-1/2})$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par with PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between IREG-PRM$^+$ and optimistic gradient descent with *adaptive* learning rate. The basic flaw of PRM$^+$ is that the ($\ell_2$-)norm of the regret vector -- which can be thought of as the inverse of the learning rate -- can decrease. By contrast, we design IREG-PRM$^+$ so as to maintain the invariance that the norm of the regret vector is nondecreasing. This enables us to derive an RVU-type bound for IREG-PRM$^+$, the first such property that does not rely on introducing additional hyperparameters to enforce smoothness. Furthermore, we find that IREG-PRM$^+$ performs on par with an adaptive version of optimistic gradient descent that we introduce whose learning rate depends on the misprediction error, demystifying the effectiveness of the regret matching family *vis-a-vis* more standard optimization techniques.
