Table of Contents
Fetching ...

Equivariant Networks for Zero-Shot Coordination

Darius Muglich, Christian Schroeder de Witt, Elise van der Pol, Shimon Whiteson, Jakob Foerster

TL;DR

This work tackles zero-shot coordination in Dec-POMDPs by enforcing symmetry through an equivariant network, eliminating symmetry-breaking conventions that arise when independently trained agents with partial observability are paired. The core idea is a scalable symmetrizer $S$ that projects any network into the $G$-equivariant subspace, ensuring policy outputs transform consistently under environmental symmetries via $\mathbf{K}_g \psi(\mathbf{x}) = \psi(\mathbf{L}_g \mathbf{x})$. The authors prove that $S(\psi)$ lies in the equivariant class and that the entire equivariant space is reachable as the run of $S$, enabling a test-time coordination-improvement operator that works with any self-play policy. Empirically on Hanabi, EQC yields state-of-the-art zero-shot coordination, improves cross-play for a diverse set of policies, and remains effective when using symmetry as a subgroup rather than the full automorphism group. The work offers a scalable, principled path to robust multi-agent coordination by embedding symmetry into the model rather than relying on data augmentation.

Abstract

Successful coordination in Dec-POMDPs requires agents to adopt robust strategies and interpretable styles of play for their partner. A common failure mode is symmetry breaking, when agents arbitrarily converge on one out of many equivalent but mutually incompatible policies. Commonly these examples include partial observability, e.g. waving your right hand vs. left hand to convey a covert message. In this paper, we present a novel equivariant network architecture for use in Dec-POMDPs that effectively leverages environmental symmetry for improving zero-shot coordination, doing so more effectively than prior methods. Our method also acts as a ``coordination-improvement operator'' for generic, pre-trained policies, and thus may be applied at test-time in conjunction with any self-play algorithm. We provide theoretical guarantees of our work and test on the AI benchmark task of Hanabi, where we demonstrate our methods outperforming other symmetry-aware baselines in zero-shot coordination, as well as able to improve the coordination ability of a variety of pre-trained policies. In particular, we show our method can be used to improve on the state of the art for zero-shot coordination on the Hanabi benchmark.

Equivariant Networks for Zero-Shot Coordination

TL;DR

This work tackles zero-shot coordination in Dec-POMDPs by enforcing symmetry through an equivariant network, eliminating symmetry-breaking conventions that arise when independently trained agents with partial observability are paired. The core idea is a scalable symmetrizer that projects any network into the -equivariant subspace, ensuring policy outputs transform consistently under environmental symmetries via . The authors prove that lies in the equivariant class and that the entire equivariant space is reachable as the run of , enabling a test-time coordination-improvement operator that works with any self-play policy. Empirically on Hanabi, EQC yields state-of-the-art zero-shot coordination, improves cross-play for a diverse set of policies, and remains effective when using symmetry as a subgroup rather than the full automorphism group. The work offers a scalable, principled path to robust multi-agent coordination by embedding symmetry into the model rather than relying on data augmentation.

Abstract

Successful coordination in Dec-POMDPs requires agents to adopt robust strategies and interpretable styles of play for their partner. A common failure mode is symmetry breaking, when agents arbitrarily converge on one out of many equivalent but mutually incompatible policies. Commonly these examples include partial observability, e.g. waving your right hand vs. left hand to convey a covert message. In this paper, we present a novel equivariant network architecture for use in Dec-POMDPs that effectively leverages environmental symmetry for improving zero-shot coordination, doing so more effectively than prior methods. Our method also acts as a ``coordination-improvement operator'' for generic, pre-trained policies, and thus may be applied at test-time in conjunction with any self-play algorithm. We provide theoretical guarantees of our work and test on the AI benchmark task of Hanabi, where we demonstrate our methods outperforming other symmetry-aware baselines in zero-shot coordination, as well as able to improve the coordination ability of a variety of pre-trained policies. In particular, we show our method can be used to improve on the state of the art for zero-shot coordination on the Hanabi benchmark.
Paper Structure (25 sections, 5 theorems, 11 equations, 5 figures, 3 tables)

This paper contains 25 sections, 5 theorems, 11 equations, 5 figures, 3 tables.

Key Result

Proposition 1

(Symmetric Property) $S(\psi) \in \mathbf{\Psi}_\text{equiv},$ for all $\psi \in \mathbf{\Psi}$; that is, $S$ maps neural networks to equivariant neural networks.

Figures (5)

  • Figure 1: Illustrating different kinds of symmetry-robust policies, where the symmetries in this example are colors. Policies make actions based on the number of circles and the color. Invariant policies act irrespective of the change in color (i.e. only the number of circles matters), while equivariant policies act in correspondence with the change in color (i.e. changing the color will cause a corresponding change to the action).
  • Figure 2: Conditional action matrices of IQL, i.e. $P(a_t^i \ | \ a_{t-1}^j)$, unsymmetrized (left) and symmetrized at test time (middle is $C_5$-symmetrized and right is $D_{10}$-symmetrized). The y-axis represents the action taken at timesetep $t$ and the x-axis shows the proportion of each action as response at timestep $t+1$. The matrices show the interactions between color/rank hinting and play/discarding. C1-5 and R1-5 mean hinting the 5 different colors and ranks respectively, and P1-5 and D1-5 mean playing and discarding the 1st-5th cards in the hand. We selected a random agent, and each plot is thereby computed by running 1000 episodes of self-play with the agent to compute the statistics.
  • Figure 3: Conditional action matrices of SAD, i.e. $P(a_t^i \ | \ a_{t-1}^j)$, unsymmetrized (left) and symmetrized at test time (middle is $C_5$-symmetrized and right is $D_{10}$-symmetrized). The y-axis represents the action taken at timesetep $t$ and the x-axis shows the proportion of each action as response at timestep $t+1$. The matrices show the interactions between color/rank hinting and play/discarding. C1-5 and R1-5 mean hinting the 5 different colors and ranks respectively, and P1-5 and D1-5 mean playing and discarding the 1st-5th cards in the hand. We selected a random agent, and each plot is thereby computed by running 1000 episodes of self-play with the agent to compute the statistics.
  • Figure 4: Conditional action matrices of OP, i.e. $P(a_t^i \ | \ a_{t-1}^j)$, unsymmetrized (left) and symmetrized at test time (middle is $C_5$-symmetrized and right is $D_{10}$-symmetrized). The y-axis represents the action taken at timesetep $t$ and the x-axis shows the proportion of each action as response at timestep $t+1$. The matrices show the interactions between color/rank hinting and play/discarding. C1-5 and R1-5 mean hinting the 5 different colors and ranks respectively, and P1-5 and D1-5 mean playing and discarding the 1st-5th cards in the hand. We selected a random agent, and each plot is thereby computed by running 1000 episodes of self-play with the agent to compute the statistics.
  • Figure 5: Conditional action matrices of $G$-equivariant agents, i.e. $P(a_t^i \ | \ a_{t-1}^j)$, $C_5$ on left and $D_{10}$ on right. The y-axis represents the action taken at timesetep $t$ and the x-axis shows the proportion of each action as response at timestep $t+1$. The matrices show the interactions between color/rank hinting and play/discarding. C1-5 and R1-5 mean hinting the 5 different colors and ranks respectively, and P1-5 and D1-5 mean playing and discarding the 1st-5th cards in the hand. We selected a random agent $C_5$-equivariant agent and a random $D_{10}$-equivariant agent, and each plot is thereby computed by running 1000 episodes of self-play with each agent to compute the statistics.

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition
  • proof
  • Proposition
  • proof
  • Proposition
  • proof