Table of Contents
Fetching ...

Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games

Brian Hu Zhang, Ioannis Anagnostides, Tuomas Sandholm

TL;DR

Addressing the long-standing gap between theory and practice in zero-sum online learning, the paper introduces a scale-invariant, parameter-free variant of regret matching (IREG-PRM$^+$) and an adaptive optimistic gradient descent (AdOGD) with RVU-type guarantees, achieving $O_T(1/T)$ average-iterate and $O_T(1/ olinebreak[4] olinebreak[4sqrt]{T})$ best-iterate convergence. It further develops IR-PRM and IR-PRM$^+$ with predictions and a nondecreasing regret norm, plus an extragradient variant (IREG-PRM$^+$ EG) that yields $O_T(1/T)$ equilibrium guarantees in zero-sum games, all while maintaining competitive performance in benchmarks. The work unifies regret-matching with gradient-based optimization, clarifying why RM-based methods perform well in practice and delivering parameter-free, scale-invariant algorithms with strong theoretical and empirical convergence guarantees. Overall, it closes the theory-practice gap in zero-sum game solving and provides practical, scalable tools for self-play and adversarial learning.

Abstract

A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established since Nemirovski's mirror-prox algorithm and Nesterov's excessive gap technique in the early 2000s, the most effective paradigm in practice is *counterfactual regret minimization*, which is based on *regret matching* and its modern variants. In particular, the state of the art across most benchmarks is *predictive* regret matching$^+$ (PRM$^+$), in conjunction with non-uniform averaging. Yet, such algorithms can exhibit slower $Ω(T^{-1/2})$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par with PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between IREG-PRM$^+$ and optimistic gradient descent with *adaptive* learning rate. The basic flaw of PRM$^+$ is that the ($\ell_2$-)norm of the regret vector -- which can be thought of as the inverse of the learning rate -- can decrease. By contrast, we design IREG-PRM$^+$ so as to maintain the invariance that the norm of the regret vector is nondecreasing. This enables us to derive an RVU-type bound for IREG-PRM$^+$, the first such property that does not rely on introducing additional hyperparameters to enforce smoothness. Furthermore, we find that IREG-PRM$^+$ performs on par with an adaptive version of optimistic gradient descent that we introduce whose learning rate depends on the misprediction error, demystifying the effectiveness of the regret matching family *vis-a-vis* more standard optimization techniques.

Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games

TL;DR

Addressing the long-standing gap between theory and practice in zero-sum online learning, the paper introduces a scale-invariant, parameter-free variant of regret matching (IREG-PRM) and an adaptive optimistic gradient descent (AdOGD) with RVU-type guarantees, achieving average-iterate and best-iterate convergence. It further develops IR-PRM and IR-PRM with predictions and a nondecreasing regret norm, plus an extragradient variant (IREG-PRM EG) that yields equilibrium guarantees in zero-sum games, all while maintaining competitive performance in benchmarks. The work unifies regret-matching with gradient-based optimization, clarifying why RM-based methods perform well in practice and delivering parameter-free, scale-invariant algorithms with strong theoretical and empirical convergence guarantees. Overall, it closes the theory-practice gap in zero-sum game solving and provides practical, scalable tools for self-play and adversarial learning.

Abstract

A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of has long been established since Nemirovski's mirror-prox algorithm and Nesterov's excessive gap technique in the early 2000s, the most effective paradigm in practice is *counterfactual regret minimization*, which is based on *regret matching* and its modern variants. In particular, the state of the art across most benchmarks is *predictive* regret matching (PRM), in conjunction with non-uniform averaging. Yet, such algorithms can exhibit slower convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM, which we call IREG-PRM. We show that it achieves best-iterate and (i.e., optimal) average-iterate convergence guarantees, while also being on par with PRM on benchmark games. From a technical standpoint, we draw an analogy between IREG-PRM and optimistic gradient descent with *adaptive* learning rate. The basic flaw of PRM is that the (-)norm of the regret vector -- which can be thought of as the inverse of the learning rate -- can decrease. By contrast, we design IREG-PRM so as to maintain the invariance that the norm of the regret vector is nondecreasing. This enables us to derive an RVU-type bound for IREG-PRM, the first such property that does not rely on introducing additional hyperparameters to enforce smoothness. Furthermore, we find that IREG-PRM performs on par with an adaptive version of optimistic gradient descent that we introduce whose learning rate depends on the misprediction error, demystifying the effectiveness of the regret matching family *vis-a-vis* more standard optimization techniques.

Paper Structure

This paper contains 19 sections, 9 theorems, 48 equations, 2 figures, 4 algorithms.

Key Result

Proposition 2.1

Let $\bar{{\bm{x}}}^{(T)} := \frac{1}{T} \sum_{t=1}^T {\bm{x}}^{(t)}$ and $\bar{{\bm{y}}}^{(T)} := \frac{1}{T} \sum_{t=1}^T {\bm{y}}^{(t)}$. If the players have regret $\mathsf{Reg}_{\mathcal{X}}^{(T)}$ and $\mathsf{Reg}_{\mathcal{Y}}^{(T)}$ after $T$ repetitions of a zero-sum game, respectively, th

Figures (2)

  • Figure 1: $\texttt{IREG-PRM}^+$ and simultaneous $\texttt{PRM}^+$ on the counterexample game \ref{['eq:counterexample']}. In the left plot, the dark lines and light lines show the Nash gap of the last iterate and average iterate, respectively. In the middle and right plots, the dark lines show the actual regret, and the light lines show the $\ell_2$ norm of the regret vector.
  • Figure 2: Experimental results. The $x$-axis is the number of gradient evaluations (matrix-vector products with ${\mathbf{A}}$): alternating and simultaneous iterates use two gradient evaluations per iteration; extra-gradient uses four. $\texttt{DCFR}$ is not typically run with predictions, so we also do not use predictions when running $\texttt{DCFR}$, and thus "Extra-gradient $\texttt{DCFR}$" is not run. To avoid messy plots, the average iterate is only shown if it is better than the last iterate, and only the lower frontier of each curve is shown, that is, each curve plots the smallest Nash gap achieved up to that timestep.

Theorems & Definitions (19)

  • Proposition 2.1
  • Definition 2.3: RVU bound; Syrgkanis15:Fast
  • Theorem 3.1: RVU bound for $\texttt{AdOGD}$
  • proof : Proof of \ref{['theorem:RVU-adOGD']}
  • Corollary 3.2
  • proof : Proof of \ref{['cor:opt-AdOGD']}
  • Remark 3.3
  • Corollary 3.4: Bounded second-order path length for $\texttt{AdOGD}$
  • proof : Proof of \ref{['cor:pathlength']}
  • Lemma 4.1
  • ...and 9 more