Meta-Learning in Self-Play Regret Minimization
David Sychrovský, Martin Schmid, Michal Šustr, Michael Bowling
TL;DR
The paper tackles accelerating equilibrium finding in two-player zero-sum games under self-play, particularly when facing distributions of related games. It extends offline meta-learning (learning-not-to-regret) to the self-play setting by introducing a self-play–tailored meta-loss that accounts for strategies across all decision states, enabling cross-infostate communication. It presents two neural meta-learners, NOA and NPCFR, the latter preserving regret guarantees via a neural predictor within the predictive CFR framework; empirically, these meta-learners outperform traditional regret-minimization baselines on normal-form and river_poker subgames, with faster convergence and smoother trajectories. The results suggest that meta-learning regrets in self-play can substantially speed up online equilibrium computation in large-scale domains, with the potential to improve search-based algorithms, while highlighting challenges in out-of-distribution generalization and avenues for tighter meta-loss formulations.
Abstract
Regret minimization is a general approach to online optimization which plays a crucial role in many algorithms for approximating Nash equilibria in two-player zero-sum games. The literature mainly focuses on solving individual games in isolation. However, in practice, players often encounter a distribution of similar but distinct games. For example, when trading correlated assets on the stock market, or when refining the strategy in subgames of a much larger game. Recently, offline meta-learning was used to accelerate one-sided equilibrium finding on such distributions. We build upon this, extending the framework to the more challenging self-play setting, which is the basis for most state-of-the-art equilibrium approximation algorithms for domains at scale. When selecting the strategy, our method uniquely integrates information across all decision states, promoting global communication as opposed to the traditional local regret decomposition. Empirical evaluation on normal-form games and river poker subgames shows our meta-learned algorithms considerably outperform other state-of-the-art regret minimization algorithms.
