Table of Contents
Fetching ...

The Danger Of Arrogance: Welfare Equilibra As A Solution To Stackelberg Self-Play In Non-Coincidental Games

Jake Levi, Chris Lu, Timon Willi, Christian Schroeder de Witt, Jakob Foerster

TL;DR

The paper tackles learning in general-sum multi-agent systems where opponents are non-stationary and incentives may be misaligned, especially in self-play. It reframes the problem through Stackelberg strategies and introduces Welfare Equilibria (WE) to handle non-coincidental games where Stackelberg outcomes may be undesirable or fail to be Nash equilibria. A practical algorithm, WelFuSe, is proposed to adaptively select welfare functions and compute WE against unknown opponents, with new algorithms SaGa and SaSa illustrating the framework's flexibility. The work advances safer and more robust self-play in real-world multi-agent settings by unifying existing Opponent Shaping approaches under a Stackelberg-based theory and proposing a dynamic welfare-based method to avoid catastrophes in self-play.

Abstract

The increasing prevalence of multi-agent learning systems in society necessitates understanding how to learn effective and safe policies in general-sum multi-agent environments against a variety of opponents, including self-play. General-sum learning is difficult because of non-stationary opponents and misaligned incentives. Our first main contribution is to show that many recent approaches to general-sum learning can be derived as approximations to Stackelberg strategies, which suggests a framework for developing new multi-agent learning algorithms. We then define non-coincidental games as games in which the Stackelberg strategy profile is not a Nash Equilibrium. This notably includes several canonical matrix games and provides a normative theory for why existing algorithms fail in self-play in such games. We address this problem by introducing Welfare Equilibria (WE) as a generalisation of Stackelberg Strategies, which can recover desirable Nash Equilibria even in non-coincidental games. Finally, we introduce Welfare Function Search (WelFuSe) as a practical approach to finding desirable WE against unknown opponents, which finds more mutually desirable solutions in self-play, while preserving performance against naive learning opponents.

The Danger Of Arrogance: Welfare Equilibra As A Solution To Stackelberg Self-Play In Non-Coincidental Games

TL;DR

The paper tackles learning in general-sum multi-agent systems where opponents are non-stationary and incentives may be misaligned, especially in self-play. It reframes the problem through Stackelberg strategies and introduces Welfare Equilibria (WE) to handle non-coincidental games where Stackelberg outcomes may be undesirable or fail to be Nash equilibria. A practical algorithm, WelFuSe, is proposed to adaptively select welfare functions and compute WE against unknown opponents, with new algorithms SaGa and SaSa illustrating the framework's flexibility. The work advances safer and more robust self-play in real-world multi-agent settings by unifying existing Opponent Shaping approaches under a Stackelberg-based theory and proposing a dynamic welfare-based method to avoid catastrophes in self-play.

Abstract

The increasing prevalence of multi-agent learning systems in society necessitates understanding how to learn effective and safe policies in general-sum multi-agent environments against a variety of opponents, including self-play. General-sum learning is difficult because of non-stationary opponents and misaligned incentives. Our first main contribution is to show that many recent approaches to general-sum learning can be derived as approximations to Stackelberg strategies, which suggests a framework for developing new multi-agent learning algorithms. We then define non-coincidental games as games in which the Stackelberg strategy profile is not a Nash Equilibrium. This notably includes several canonical matrix games and provides a normative theory for why existing algorithms fail in self-play in such games. We address this problem by introducing Welfare Equilibria (WE) as a generalisation of Stackelberg Strategies, which can recover desirable Nash Equilibria even in non-coincidental games. Finally, we introduce Welfare Function Search (WelFuSe) as a practical approach to finding desirable WE against unknown opponents, which finds more mutually desirable solutions in self-play, while preserving performance against naive learning opponents.
Paper Structure (15 sections, 10 equations, 23 figures, 2 tables)

This paper contains 15 sections, 10 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Self-play phase portraits for different algorithms in the Impossible Market. Gradient arrow-lengths are normalised for visual clarity and do not represent gradient magnitude.
  • Figure 2: Self-play phase portraits for different algorithms in the Stag Hunt game. Axes refer to probability of hunting a stag for each player respectively, with the upper right corner corresponding to the optimal NE.
  • Figure 3: Results for 5000 steps of learning with SaSa ($\eta=0.1,M=20,N=20,\sigma=1$) against NL ($\eta=0.1$) in IPD (using discounted returns with $\gamma = 0.96$), averaged over 100 trials. Final mean rewards are -0.72972 and -1.97547 for SaSa and NL respectively.
  • Figure 4: Comparing ELOLA ($\eta=0.1, \alpha=25$, averaged over 100 trials) with WelFuSeElola (same ELOLA hyperparameters, $e=3, s=1000, b=100$, averaged over five random seeds) against different opponents in Chicken Game. Top row: against NL. Bottom row: self-play. Left column: rewards for ELOLA. Centre column: rewards for WelFuSeElola. Right column: welfare functions chosen by WelFuSeElola in each episode.
  • Figure 5: Stackelberg strategy profile (Greedy WE) for ImpossibleMarket $x^* = 0.000, y^* = 0.000, R^x = -0.000, R^y = -0.000$
  • ...and 18 more figures