Table of Contents
Fetching ...

Expected Return Symmetries

Darius Muglich, Johannes Forkel, Elise van der Pol, Jakob Foerster

TL;DR

The paper tackles zero-shot coordination in Dec-POMDPs by addressing coordination failure from symmetry differences. It introduces expected return symmetries ($\Phi^{\mathrm{ER}}$), a broad group of policy-space transformations that preserve self-play optimality for Boltzmann-exploratory policies, and shows this group contains environment symmetries ($\Phi^{\mathrm{MDP}}$) as a subgroup. A model-free gradient-based method learns a compact, diverse subset of ER symmetries to plug into the Other-Play objective, improving cross-play performance without ground-truth symmetry information. Across Iterated Three-Lever, Cat/Dog, Overcooked V2, and Hanabi, OP with ER symmetries significantly mitigates over-coordination and enhances zero-shot coordination compared to both self-play and Dec-POMDP-symmetry baselines. The work highlights practical, scalable symmetry discovery via environment interaction, while noting limitations in expressivity (restricted to bijections on actions/observations) and dependence on the chosen policy pool for ER learning.$

Abstract

Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi-agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move "left'' or "right'', and where a reward of +1 or -1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero-shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.

Expected Return Symmetries

TL;DR

The paper tackles zero-shot coordination in Dec-POMDPs by addressing coordination failure from symmetry differences. It introduces expected return symmetries (), a broad group of policy-space transformations that preserve self-play optimality for Boltzmann-exploratory policies, and shows this group contains environment symmetries () as a subgroup. A model-free gradient-based method learns a compact, diverse subset of ER symmetries to plug into the Other-Play objective, improving cross-play performance without ground-truth symmetry information. Across Iterated Three-Lever, Cat/Dog, Overcooked V2, and Hanabi, OP with ER symmetries significantly mitigates over-coordination and enhances zero-shot coordination compared to both self-play and Dec-POMDP-symmetry baselines. The work highlights practical, scalable symmetry discovery via environment interaction, while noting limitations in expressivity (restricted to bijections on actions/observations) and dependence on the chosen policy pool for ER learning.$

Abstract

Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi-agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move "left'' or "right'', and where a reward of +1 or -1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero-shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.

Paper Structure

This paper contains 25 sections, 4 theorems, 62 equations, 1 figure, 2 tables, 3 algorithms.

Key Result

Theorem 1

$\Phi^{\mathrm{ER}} := \{ \phi \in \Psi \;|\; \phi(\Pi^\alpha_*) = \Pi^\alpha_* \}$ forms a group under function composition.

Figures (1)

  • Figure 1: Conditional action matrices of $\text{OP}^{\Phi^{\text{MDP}}}$-optimal and $\text{OP}^{\Phi^{\text{ER}}}$-optimal policies; i.e., $P(a_t^i \ | \ a_{t-1}^j)$. We select the agent from both respective populations achieving the highest cross-play scores. We can see the $\text{OP}^{\Phi^{\text{ER}}}$-optimal policy more consistently uses a rank hint to signal playing the fifth card, whereas the $\text{OP}^{\Phi^{\text{MDP}}}$-optimal policy uses a similar convention but less consistently.

Theorems & Definitions (14)

  • Definition 1: Dec-POMDP Symmetries
  • Definition 2: Other-Play (OP) Objective
  • Definition 3: Symmetry Breaking
  • Example 1
  • Definition 4: Expected Return Symmetries
  • Example 2
  • Theorem
  • proof
  • Theorem : Dec-POMDP Symmetry Expected Return Invariance
  • proof
  • ...and 4 more