Table of Contents
Fetching ...

Meta-Learning in Self-Play Regret Minimization

David Sychrovský, Martin Schmid, Michal Šustr, Michael Bowling

TL;DR

The paper tackles accelerating equilibrium finding in two-player zero-sum games under self-play, particularly when facing distributions of related games. It extends offline meta-learning (learning-not-to-regret) to the self-play setting by introducing a self-play–tailored meta-loss that accounts for strategies across all decision states, enabling cross-infostate communication. It presents two neural meta-learners, NOA and NPCFR, the latter preserving regret guarantees via a neural predictor within the predictive CFR framework; empirically, these meta-learners outperform traditional regret-minimization baselines on normal-form and river_poker subgames, with faster convergence and smoother trajectories. The results suggest that meta-learning regrets in self-play can substantially speed up online equilibrium computation in large-scale domains, with the potential to improve search-based algorithms, while highlighting challenges in out-of-distribution generalization and avenues for tighter meta-loss formulations.

Abstract

Regret minimization is a general approach to online optimization which plays a crucial role in many algorithms for approximating Nash equilibria in two-player zero-sum games. The literature mainly focuses on solving individual games in isolation. However, in practice, players often encounter a distribution of similar but distinct games. For example, when trading correlated assets on the stock market, or when refining the strategy in subgames of a much larger game. Recently, offline meta-learning was used to accelerate one-sided equilibrium finding on such distributions. We build upon this, extending the framework to the more challenging self-play setting, which is the basis for most state-of-the-art equilibrium approximation algorithms for domains at scale. When selecting the strategy, our method uniquely integrates information across all decision states, promoting global communication as opposed to the traditional local regret decomposition. Empirical evaluation on normal-form games and river poker subgames shows our meta-learned algorithms considerably outperform other state-of-the-art regret minimization algorithms.

Meta-Learning in Self-Play Regret Minimization

TL;DR

The paper tackles accelerating equilibrium finding in two-player zero-sum games under self-play, particularly when facing distributions of related games. It extends offline meta-learning (learning-not-to-regret) to the self-play setting by introducing a self-play–tailored meta-loss that accounts for strategies across all decision states, enabling cross-infostate communication. It presents two neural meta-learners, NOA and NPCFR, the latter preserving regret guarantees via a neural predictor within the predictive CFR framework; empirically, these meta-learners outperform traditional regret-minimization baselines on normal-form and river_poker subgames, with faster convergence and smoother trajectories. The results suggest that meta-learning regrets in self-play can substantially speed up online equilibrium computation in large-scale domains, with the potential to improve search-based algorithms, while highlighting challenges in out-of-distribution generalization and avenues for tighter meta-loss formulations.

Abstract

Regret minimization is a general approach to online optimization which plays a crucial role in many algorithms for approximating Nash equilibria in two-player zero-sum games. The literature mainly focuses on solving individual games in isolation. However, in practice, players often encounter a distribution of similar but distinct games. For example, when trading correlated assets on the stock market, or when refining the strategy in subgames of a much larger game. Recently, offline meta-learning was used to accelerate one-sided equilibrium finding on such distributions. We build upon this, extending the framework to the more challenging self-play setting, which is the basis for most state-of-the-art equilibrium approximation algorithms for domains at scale. When selecting the strategy, our method uniquely integrates information across all decision states, promoting global communication as opposed to the traditional local regret decomposition. Empirical evaluation on normal-form games and river poker subgames shows our meta-learned algorithms considerably outperform other state-of-the-art regret minimization algorithms.

Paper Structure

This paper contains 27 sections, 14 equations, 12 figures, 1 table, 2 algorithms.

Figures (12)

  • Figure 1: Computational graphs of NOA$^{(+)}$ (left) and NPCFR$^{(+)}$ (right). The gradient $\partial \mathcal{L} / \partial \theta$ originates in the collection of maximal instantaneous regrets $\left\lVert \boldsymbol{r} ^{1 \dots T}_i\right\rVert_\infty$ and propagates through the strategies $\bm{\sigma}^{1 \dots T}$ (the predictions $\boldsymbol{p} ^{1 \dots T}_i$ for NPCFR$^{(+)}$), the rewards $\boldsymbol{x} ^{1 \dots T}_i(\bm{\sigma}^{1 \dots T}(\theta))$ coming from the rest of the game, the cumulative regret $\boldsymbol{R} ^{0\dots T-1}_i$, and hidden states $\boldsymbol{h} ^{0 \dots T-1}_i$.
  • Figure 2: Comparison of non-meta-learned algorithms ( CFR$^{(+)}$, PCFR$^{(+)}$, DCFR, and SPCFR$^{+}$) with meta-learned algorithms ( NOA$^{(+)}$ and NPCFR$^{(+)}$) on rock_paper_scissors (left) and river_poker (right). The figures show exploitability of the average strategy $\overline{\bm{\sigma}}^t$. Vertical dashed lines separate the training (up to $T=32$ steps) and the generalization (from $T$ to $2T$ steps) regimes. See Figure \ref{['fig: results with errors']} for standard errors.
  • Figure 3: Comparison of the convergence in average strategy $\overline{\bm{\sigma}}^t$ on a sample of rock_paper_scissors over $2T=64$ steps. The red crosses show the per-player equilibria of the sampled game. The quadrilaterals show the region of equilibria BokHla2015a. We use blue for the first player and green for the second. The trajectories start in dark colors and get brighter for later steps. See Figure \ref{['fig: policy space convergence']} in Appendix \ref{['app: additional results']} for current strategy convergence.
  • Figure 4: Neural network architecture used for all meta-learned algorithms applied to a game with three infostates $\{s_k\}_{k=1}^3$. Same colours indicate shared parameters. See Appendix \ref{['app: network architecture']} for the full description of the network.
  • Figure 5: Comparison of the convergence in current strategy $\bm{\sigma}^t$ on a sample of rock_paper_scissors over $2T=64$ steps. The red crosses show the per-player equilibria of the sampled game. The trajectories start in dark colors and get brighter for later steps.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 1