Learning not to Regret

David Sychrovský; Michal Šustr; Elnaz Davoodi; Michael Bowling; Marc Lanctot; Martin Schmid

Learning not to Regret

David Sychrovský, Michal Šustr, Elnaz Davoodi, Michael Bowling, Marc Lanctot, Martin Schmid

TL;DR

Addresses equilibrium finding when games are drawn from a distribution rather than a single instance. Proposes offline meta-learning of regret minimizers, culminating in Neural Predictive Regret Matching (NPRM) that uses a neural predictor within predictive regret matching to accelerate convergence while guaranteeing $R^{\text{ext},T}=O(\sqrt{T})$ regret for arbitrary games. Empirically, NPRM and the meta-learned NOA/NPRM substantially outperform non-meta-learned baselines, achieving around an order-of-magnitude faster convergence on river_poker and strong speedups in matrix and sequential game tests. This approach enables faster decision-time search with value functions and demonstrates a practical path to domain-specific, robust equilibrium learning.

Abstract

The literature on game-theoretic equilibrium finding predominantly focuses on single games or their repeated play. Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution. We present a novel "learning not to regret" framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game. We validated our algorithms' faster convergence on a distribution of river poker games. Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements.

Learning not to Regret

TL;DR

regret for arbitrary games. Empirically, NPRM and the meta-learned NOA/NPRM substantially outperform non-meta-learned baselines, achieving around an order-of-magnitude faster convergence on river_poker and strong speedups in matrix and sequential game tests. This approach enables faster decision-time search with value functions and demonstrates a practical path to domain-specific, robust equilibrium learning.

Abstract

Paper Structure (20 sections, 1 theorem, 9 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 1 theorem, 9 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Prior Work
Background
Learning not to Regret
Meta-Learning Framework
Neural Online Algorithm
Neural Predictive Regret Matching
Experiments
Matrix Games
Sequential Games
Computational Time Reduction
Out of Distribution Convergence
Additional Experiments
Conclusion
Acknowledgements
...and 5 more sections

Key Result

Theorem 1

Let $\alpha~\ge~0$, and $\pi_\theta$ be a regret predictor with outputs bounded in $[-\alpha, \alpha]^{|A|}$. Then PRM which uses $\pi_\theta$ is a regret minimizer.

Figures (7)

Figure 1: The sequence of strategies $\{ \bm{\sigma} ^t\}_{t=1}^T$ submitted by an online algorithm and the rewards $\{ \bm{x} ^t\}_{t=1}^T$ received from the environment. The reward $\bm{x} ^0 = \mathbf{0}$ initializes the algorithms to produce the first strategy $\bm{\sigma} ^1$.
Figure 2: Computational graphs of the proposed algorithms. The gradient flows only along the solid edges. The $\bm{h}$ denotes the hidden state of the neural network. See also Figure \ref{['fig: strategy reward sequence']} for visual correspondence of the strategy and reward sequence.
Figure 3: Comparison of non-meta-learned algorithms (RM, PRM) with meta-learned algorithms (NOA, NPRM), on a small matrix game and a large sequential game and for a single fixed game versus a whole distribution over games. The figures show exploitability of the average strategy $\overline{ \bm{\sigma} }^t$. The y-axis uses a logarithmic scale. Vertical dashed lines separate two regimes: training (up to $T$ steps) and generalization (from $T$ to $2T$ steps). Colored areas show standard error for the sampled settings.
Figure 4: For each algorithm, we show the trajectories of current strategies $\bm{\sigma} ^t$ (top row) and average strategies $\overline{ \bm{\sigma} }^t$ (bottom row) on rock_paper_scissors (sampled) for $2T=128$ steps. The red cross shows the equilibrium of the sampled game. The trajectories start in dark colors and get brighter for later steps. The blue polygon is the set of all equilibria in the distribution rock_paper_scissors (sampled), computed according to BokHla2015a. Notice how the strategies of our meta-learned algorithms begin in the polygon and refine their strategy to reach the current equilibrium. In contrast, (P)RM are initialized with the uniform strategy and visit a large portion of the policy space.
Figure 5: Comparison of regret minimization algorithms as a function of wall time, rather than number of steps shown in Figure \ref{['fig: results']}.
...and 2 more figures

Theorems & Definitions (2)

Theorem 1: Correctness of Neural-Predicting
proof

Learning not to Regret

TL;DR

Abstract

Learning not to Regret

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)