Table of Contents
Fetching ...

Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Jonathan Gray, Adam Lerer, Anton Bakhtin, Noam Brown

TL;DR

The paper tackles AI performance in a mixed cooperative/competitive multi-agent game by marrying imitation-learned blueprint policies with one-step regret-minimization search to operate in no-press Diplomacy. The approach yields human-level performance and strong robustness to exploitation, demonstrated through large-scale anonymous human play and cross-agent benchmarks. Key contributions include a refined supervised blueprint trained on extensive human data, an efficient RM-based equilibrium search for the current turn, and comprehensive exploitability analyses showing practical viability. The results suggest regret minimization can effectively scale to complex, cooperative-adversarial domains and point to future work in deeper search and integration with reinforcement learning.

Abstract

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via regret minimization. Regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and ranks in the top 2% of human players when playing anonymous games on a popular Diplomacy website.

Human-Level Performance in No-Press Diplomacy via Equilibrium Search

TL;DR

The paper tackles AI performance in a mixed cooperative/competitive multi-agent game by marrying imitation-learned blueprint policies with one-step regret-minimization search to operate in no-press Diplomacy. The approach yields human-level performance and strong robustness to exploitation, demonstrated through large-scale anonymous human play and cross-agent benchmarks. Key contributions include a refined supervised blueprint trained on extensive human data, an efficient RM-based equilibrium search for the current turn, and comprehensive exploitability analyses showing practical viability. The results suggest regret minimization can effectively scale to complex, cooperative-adversarial domains and point to future work in deeper search and integration with reinforcement learning.

Abstract

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via regret minimization. Regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and ranks in the top 2% of human players when playing anonymous games on a popular Diplomacy website.

Paper Structure

This paper contains 24 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Left: Score of SearchBot using different numbers of sampled subgame actions, against 6 DipNet agents (paquette2019no at temperature 0.1). A score of 14.3% would be a tie. Even when sampling only two actions, SearchBot dramatically outperforms our blueprint, which achieves a score of 20.2%. Middle: The effect of the number iterations of sampled regret matching on SearchBot performance. Right: The effect of different rollout lengths on SearchBot performance.
  • Figure 2: Score of the exploiting agent against the blueprint and SearchBot-clone as a function of training time. We report the average of six runs. The shaded area corresponds to three standard errors. We use temperature 0.5 for both agents as it minimizes exploitability for the blueprint. Since SearchBot-clone is trained through imitation learning of SearchBot, the exploitability of SearchBot is almost certainly lower than SearchBot-clone.
  • Figure 3: Left: Distance of the RM average strategy from equilibrium as a function of the RM iteration, computed as the sum of all agents' exploitability in the matrix game in which RM is employed. RM reduces exploitability, while the blueprint policy has only slightly lower exploitability than the uniform distribution over the 50 sampled actions used in RM (i.e. RM iteration 1). For comparison, our human evaluations used 256-2048 RM iterations, depending on the time per turn. Right: Comparison of convergence of individual strategies to the average of two independently computed strategies. The similarity of these curves suggests that independent RM computations lead to compatible equilibria. Note: In both figures, exploitability is averaged over all phases in 28 simulated games; per-phase results are provided in Appendix \ref{['app:subgame']}.
  • Figure 4: Architecture of the model used for imitation learning in no-press. Diplomacy.
  • Figure 5: Features used for the board state encoding.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof