Expected Return Symmetries
Darius Muglich, Johannes Forkel, Elise van der Pol, Jakob Foerster
TL;DR
The paper tackles zero-shot coordination in Dec-POMDPs by addressing coordination failure from symmetry differences. It introduces expected return symmetries ($\Phi^{\mathrm{ER}}$), a broad group of policy-space transformations that preserve self-play optimality for Boltzmann-exploratory policies, and shows this group contains environment symmetries ($\Phi^{\mathrm{MDP}}$) as a subgroup. A model-free gradient-based method learns a compact, diverse subset of ER symmetries to plug into the Other-Play objective, improving cross-play performance without ground-truth symmetry information. Across Iterated Three-Lever, Cat/Dog, Overcooked V2, and Hanabi, OP with ER symmetries significantly mitigates over-coordination and enhances zero-shot coordination compared to both self-play and Dec-POMDP-symmetry baselines. The work highlights practical, scalable symmetry discovery via environment interaction, while noting limitations in expressivity (restricted to bijections on actions/observations) and dependence on the chosen policy pool for ER learning.$
Abstract
Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi-agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move "left'' or "right'', and where a reward of +1 or -1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero-shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.
