Table of Contents
Fetching ...

Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization

Ian Gemp, Luke Marris, Georgios Piliouras

TL;DR

This work tackles the computational bottleneck of finding approximate Nash equilibria in $n$-player, general-sum normal-form games by recasting NE computation as a stochastic non-convex optimization problem. The authors introduce a novel loss $\mathcal{L}^{\tau}(\boldsymbol{x})$ that admits unbiased Monte Carlo estimation and is Lipschitz and bounded, enabling efficient optimization with SGD and bandit-based methods. They establish a theoretical connection between the loss and exploitability, extend to entropy-regularized (quantal response) equilibria, and derive convergence guarantees via X-armed bandits and StoSOO under the condition of polymatrix-isolated equilibria; their experiments show SGD can outperform prior baselines in some settings. The results open a scalable route to approximate equilibria in large, multi-agent systems and suggest future work on extensive-form games and broader optimization techniques.

Abstract

We propose the first loss function for approximate Nash equilibria of normal-form games that is amenable to unbiased Monte Carlo estimation. This construction allows us to deploy standard non-convex stochastic optimization techniques for approximating Nash equilibria, resulting in novel algorithms with provable guarantees. We complement our theoretical analysis with experiments demonstrating that stochastic gradient descent can outperform previous state-of-the-art approaches.

Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization

TL;DR

This work tackles the computational bottleneck of finding approximate Nash equilibria in -player, general-sum normal-form games by recasting NE computation as a stochastic non-convex optimization problem. The authors introduce a novel loss that admits unbiased Monte Carlo estimation and is Lipschitz and bounded, enabling efficient optimization with SGD and bandit-based methods. They establish a theoretical connection between the loss and exploitability, extend to entropy-regularized (quantal response) equilibria, and derive convergence guarantees via X-armed bandits and StoSOO under the condition of polymatrix-isolated equilibria; their experiments show SGD can outperform prior baselines in some settings. The results open a scalable route to approximate equilibria in large, multi-agent systems and suggest future work on extensive-form games and broader optimization techniques.

Abstract

We propose the first loss function for approximate Nash equilibria of normal-form games that is amenable to unbiased Monte Carlo estimation. This construction allows us to deploy standard non-convex stochastic optimization techniques for approximating Nash equilibria, resulting in novel algorithms with provable guarantees. We complement our theoretical analysis with experiments demonstrating that stochastic gradient descent can outperform previous state-of-the-art approaches.
Paper Structure (48 sections, 52 theorems, 134 equations, 6 figures, 3 tables)

This paper contains 48 sections, 52 theorems, 134 equations, 6 figures, 3 tables.

Key Result

Lemma 1

Assuming player $k$'s utility, $u_k(x_k, x_{-k})$, is concave in its own strategy $x_k$, a strategy in the interior of the simplex is a best response $\texttt{BR}_k$ if and only if it has zero projected-gradientNot to be confused with the nonlinear (i.e., introduces bias) projected gradient operator

Figures (6)

  • Figure 1: Effect of Sampled Play on a Biased Loss. The first row displays the expectation of the upper bound guaranteed by our proposed loss $\mathcal{L}^{\tau}$ with $\eta_k=1$ for all $k$. The second row displays the expectation of NashConv under sampled play, i.e., $\sum_k \epsilon_k$ where $\epsilon_k = \mathbb{E}_{a_{-k} \sim x_{-k}}[\max_{a_k} u_k^{\tau}(\boldsymbol{a})] - \mathbb{E}_{\boldsymbol{a} \sim \boldsymbol{x}}[u_k^{\tau}(\boldsymbol{a})]$. To be consistent, we subtract the offset $\tau \log(m^2)$ from $f_{\tau}(\mathcal{L}^{\tau})$ per Lemma \ref{['lemma:qre_to_ne']}, which relates the exploitability at positive temperature to that at zero temperature. The resulting loss surface clearly shows NashConv fails to recognize any interior Nash equilibrium due to its inherent bias.
  • Figure 2: Analysis of Loss Landscape. We reapply the analysis of dauphin2014identifying, originally designed to understand the success of SGD in deep learning, to "slices" of several popular extensive form games. To construct a slice (or meta-game), we randomly sample $6$ deterministic policies and then consider the corresponding $n$-player, $6$-action normal-form game at $\tau=0.1$ (with payoffs normalized to $[0, 1]$). The index of a critical point $\boldsymbol{x}_c$ ($\nabla_{\boldsymbol{x}} \mathcal{L}^{\tau}(\boldsymbol{x}_c) = \mathbf{0}$) indicates the fraction of negative eigenvalues in the Hessian of $\mathcal{L}^{\tau}$ at $\boldsymbol{x}_c$; $\alpha=0$ indicates a local minimum, $1$ a maximum, else a saddle point. We see a positive correlation between exploitability ($y$-axis), projected-gradient norm ($x$-axis), and $\alpha$ (color) indicating a lower prevalence of local minima at high exploitability.
  • Figure 3: Comparison of SGD on $\mathcal{L}^{\tau=0}$ against baselines on four games evaluated in gemp2022sample. The number of samples used to estimate each update iteration (i.e., minibatch size) is indicated by $s$. From left to right: $2$-player, $3$-action, nonsymmetric; $6$-player, $5$-action, nonsymmetric; $4$-player, $66$-action, symmetric; $3$-player, $286$-action, symmetric. SGD struggles at saddle points in Blotto.
  • Figure 4: Bandit-based (BLiN) Nash solver applied to an artificial $7$-player, symmetric, $2$-action game. We search for a symmetric equilibrium, which is represented succinctly as the probability of selecting action $1$. The plot shows the true exploitability $\epsilon$ of all symmetric strategies in black and indicates there exist potentially $5$ NEs (the dips in the curve). Upper bounds on our unregularized loss $\mathcal{L}$ capture $4$ of these equilibria, missing only the pure NE on the right. By considering our regularized loss, $\mathcal{L}^{\tau}$, we are able to capture this pure NE (see zoomed inset). The bandit algorithm selects strategies to evaluate, using $10$ Monte-Carlo samples for each evaluation (arm pull) of $\mathcal{L}^{\tau}$. These samples are displayed as vertical bars above with the height of the vertical bar representing additional arm pulls. The best arms throughout search are denoted by green circles (darker indicates later in the search). The boxed numbers near equilibria display the welfare of the strategy profile.
  • Figure 5: Upper Bound ($\epsilon \le f_{\tau}(\mathcal{L}^{\tau})$) Heatmap Visualization. The first row examines the loss landscape for the classic anti-coordination game of Chicken (Nash equilibria: $(0,1), (1,0), (2/3, 1/3)$) while the second row examines the Prisoner's dilemma (Unique Nash equilibrium: $(0,0)$). For improved visibility, we subtract the offset $\tau \log(m^2)$ from $f_{\tau}(\mathcal{L}^{\tau})$ per Lemma \ref{['lemma:qre_to_ne']}, which relates the exploitability at positive temperature to that at zero temperature. Temperature increases for each plot moving to the right. For high temperatures, interior (fully-mixed) strategies are incentivized while for lower temperatures, nearly pure strategies can achieve minimum exploitability. For zero temperature, pure strategy equilibria (e.g., defect-defect) are not captured by the loss as illustrated by the bottom-left Prisoner's Dilemma plot with a constant loss surface.
  • ...and 1 more figures

Theorems & Definitions (112)

  • Definition 1: Polymatrix-Isolated Equilibrium
  • Lemma 1
  • proof
  • Proposition 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • ...and 102 more