Table of Contents
Fetching ...

Mastering Zero-Shot Interactions in Cooperative and Competitive Simultaneous Games

Yannik Mahlau, Frederik Schubert, Bodo Rosenhahn

TL;DR

The paper addresses zero-shot interaction in simultaneous multi-agent games by introducing the Smooth Best Response Logit Equilibrium (SBRLE) and the Albatross framework, which learns to approximate SBRLE through a two-stage training regime (proxy and response models) conditioned on opponent rationality via a temperature parameter $\tau$. By combining planning with self-play, Albatross adapts to agents of varying strength, enabling cooperative behavior with unknown partners and exploiting weak opponents in competitive settings. Empirical results show state-of-the-art performance in cooperative Overcooked (approximately 37.6% improvement over prior work) and superior exploitation of weaker agents in Battlesnake, while online estimation of opponents’ rationality sheds light on the dynamics of zero-shot interactions. This approach advances human-AI collaboration by modeling bounded rationality and provides a scalable framework for zero-shot coordination and competition in complex multi-agent environments, with publicly available code for reproducibility.

Abstract

The combination of self-play and planning has achieved great successes in sequential games, for instance in Chess and Go. However, adapting algorithms such as AlphaZero to simultaneous games poses a new challenge. In these games, missing information about concurrent actions of other agents is a limiting factor as they may select different Nash equilibria or do not play optimally at all. Thus, it is vital to model the behavior of the other agents when interacting with them in simultaneous games. To this end, we propose Albatross: AlphaZero for Learning Bounded-rational Agents and Temperature-based Response Optimization using Simulated Self-play. Albatross learns to play the novel equilibrium concept of a Smooth Best Response Logit Equilibrium (SBRLE), which enables cooperation and competition with agents of any playing strength. We perform an extensive evaluation of Albatross on a set of cooperative and competitive simultaneous perfect-information games. In contrast to AlphaZero, Albatross is able to exploit weak agents in the competitive game of Battlesnake. Additionally, it yields an improvement of 37.6% compared to previous state of the art in the cooperative Overcooked benchmark.

Mastering Zero-Shot Interactions in Cooperative and Competitive Simultaneous Games

TL;DR

The paper addresses zero-shot interaction in simultaneous multi-agent games by introducing the Smooth Best Response Logit Equilibrium (SBRLE) and the Albatross framework, which learns to approximate SBRLE through a two-stage training regime (proxy and response models) conditioned on opponent rationality via a temperature parameter . By combining planning with self-play, Albatross adapts to agents of varying strength, enabling cooperative behavior with unknown partners and exploiting weak opponents in competitive settings. Empirical results show state-of-the-art performance in cooperative Overcooked (approximately 37.6% improvement over prior work) and superior exploitation of weaker agents in Battlesnake, while online estimation of opponents’ rationality sheds light on the dynamics of zero-shot interactions. This approach advances human-AI collaboration by modeling bounded rationality and provides a scalable framework for zero-shot coordination and competition in complex multi-agent environments, with publicly available code for reproducibility.

Abstract

The combination of self-play and planning has achieved great successes in sequential games, for instance in Chess and Go. However, adapting algorithms such as AlphaZero to simultaneous games poses a new challenge. In these games, missing information about concurrent actions of other agents is a limiting factor as they may select different Nash equilibria or do not play optimally at all. Thus, it is vital to model the behavior of the other agents when interacting with them in simultaneous games. To this end, we propose Albatross: AlphaZero for Learning Bounded-rational Agents and Temperature-based Response Optimization using Simulated Self-play. Albatross learns to play the novel equilibrium concept of a Smooth Best Response Logit Equilibrium (SBRLE), which enables cooperation and competition with agents of any playing strength. We perform an extensive evaluation of Albatross on a set of cooperative and competitive simultaneous perfect-information games. In contrast to AlphaZero, Albatross is able to exploit weak agents in the competitive game of Battlesnake. Additionally, it yields an improvement of 37.6% compared to previous state of the art in the cooperative Overcooked benchmark.
Paper Structure (32 sections, 2 theorems, 16 equations, 23 figures, 4 tables, 4 algorithms)

This paper contains 32 sections, 2 theorems, 16 equations, 23 figures, 4 tables, 4 algorithms.

Key Result

Theorem 5.1

The transformed utilities $\tilde{u}_i(\pi)$ defined as $\tilde{u}_i(\pi) = u_i(\pi) + \frac{1}{\tau} \psi(\pi_i)$ using Shannon entropy $\psi(\pi_i) = \sum_{a_i \in A_i} \pi_i(a_i) \log(\pi_i(a_i))$ as a smoothing function are maximized by the softmax function $\pi_i = \mathit{SBR}(\pi_{-i}, \tau)

Figures (23)

  • Figure 1: TrueSkill scores trueskill of a tournament consisting of an Albatross agent, Monte-Carlo-Tree-Search (MCTS) baseline agents and an AlphaZero baseline. Each game takes place in a free for all setting of four agents in the stochastic simultaneous game of Battlesnake. Albatross estimates the temperature, i.e. rationality, of the baseline agents online using only data from the current game. A temperature of 0 corresponds to random play and 10 to optimal play if all other agents play optimally as well. AlphaZero achieves optimal play given that all other agents play optimally (temperature of 10), but fails to adapt to subrational agents. In contrast, Albatross is able to respond optimally against any combination of weak and strong agents due to its rationality estimation, resulting in a higher TrueSkill tournament score.
  • Figure 2: Visualization of equilibria in a zero-sum NFG. Assuming that player 2 plays a best response (BR) to the policy of player 1, the expected utility is lower than under the assumption of a subrational smooth best response $\mathit{SBR}(\cdot, \tau = 0.3)$. The dotted gray lines denote the expected utility of playing actions a or b against an SBR respectively. The NE maximizes the expected utility assuming player 2 plays a BR, while the QSE maximizes under the assumption of an SBR. The SBRLE starts with response temperature $\tau_R$ at a uniform distribution over actions a and b, and ends with $\tau_R \rightarrow \infty$ at the BRLE. The SBRLE is equal to the Logit equilibrium (LE) if the response temperature $\tau_R$ is equal to the temperature $\tau$ of the LE.
  • Figure 3: Training architecture of the proxy and response models of Albatross. Both models are trained via planning-augmented self-play using fixed-depth search and are conditioned on one (proxy model) or multiple (response model) temperatures $\tau$ that are drawn from a distribution $p(\tau)$. The response model uses the trained proxy model to compute the Smooth Best Response Logit Equilibrium (SBRLE).
  • Figure 4: Albatross agent (right) and a possibly weak agent (left) in the Asymmetric Advantage layout of Overcooked. If the left agent plays rationally, they should realize that they have a shorter path to their serving location (gray tile). They would move down, retrieve a dish and deliver the soup. Having a strong estimation of rationality (e.g. $\tau = 10$), Albatross trusts them to deliver the soup and moves up to pick up an onion and prepare the next soup in the other pot . If Albatross has an estimation of weak rationality for the left agent (e.g. $\tau = 0$), then Albatross moves down to retrieve a dish , collects the soup and serves it themselves .
  • Figure 5: Cooperation performance with a behavior cloning agent trained on a dataset of human play overcooked in all five layouts of Overcooked. Episodes last 400 time steps and agents receive a common reward of 20 for delivering a soup.
  • ...and 18 more figures

Theorems & Definitions (4)

  • Theorem 5.1
  • proof
  • Theorem 6.1
  • proof