Table of Contents
Fetching ...

Entropy is all you need for Inter-Seed Cross-Play in Hanabi

Johannes Forkel, Jakob Foerster

TL;DR

The paper investigates zero-shot coordination in Hanabi and shows that simple entropy-regularized IPPO can dramatically improve cross-seed cross-play, achieving new state-of-the-art XP. It demonstrates that a moderate entropy level (e.g., $\alpha \approx 0.05$) and recurrent actor-critic architectures, paired with $\lambda_{GAE} \approx 0.9$, substantially reduce symmetry-breaking conventions across seeds. The authors provide both toy and full Hanabi analyses, highlighting that while entropy can align inter-seed policies, there exist Dec-POMDPs where entropy alone cannot guarantee optimal symmetric strategies, thus motivating ongoing development of dedicated ZSC algorithms. The results offer practical guidance for hyperparameter choices in cross-seed MARL experiments and underscore the ongoing importance of zero-shot coordination research in complex cooperative tasks.

Abstract

We find that in Hanabi, one of the most complex and popular benchmarks for zero-shot coordination and ad-hoc teamplay, a standard implementation of independent PPO with a slightly higher entropy coefficient 0.05 instead of the typically used 0.01, achieves a new state-of-the-art in cross-play between different seeds, beating by a significant margin all previous specialized algorithms, which were specifically designed for this setting. We provide an intuition for why sufficiently high entropy regularization ensures that different random seed produce joint policies which are mutually compatible. We also empirically find that a high $λ_{\text{GAE}}$ around 0.9, and using RNNs instead of just feed-forward layers in the actor-critic architecture, strongly increase inter-seed cross-play. While these results demonstrate the dramatic effect that hyperparameters can have not just on self-play scores but also on cross-play scores, we show that there are simple Dec-POMDPs though, in which standard policy gradient methods with increased entropy regularization are not able to achieve perfect inter-seed cross-play, thus demonstrating the continuing necessity for new algorithms for zero-shot coordination.

Entropy is all you need for Inter-Seed Cross-Play in Hanabi

TL;DR

The paper investigates zero-shot coordination in Hanabi and shows that simple entropy-regularized IPPO can dramatically improve cross-seed cross-play, achieving new state-of-the-art XP. It demonstrates that a moderate entropy level (e.g., ) and recurrent actor-critic architectures, paired with , substantially reduce symmetry-breaking conventions across seeds. The authors provide both toy and full Hanabi analyses, highlighting that while entropy can align inter-seed policies, there exist Dec-POMDPs where entropy alone cannot guarantee optimal symmetric strategies, thus motivating ongoing development of dedicated ZSC algorithms. The results offer practical guidance for hyperparameter choices in cross-seed MARL experiments and underscore the ongoing importance of zero-shot coordination research in complex cooperative tasks.

Abstract

We find that in Hanabi, one of the most complex and popular benchmarks for zero-shot coordination and ad-hoc teamplay, a standard implementation of independent PPO with a slightly higher entropy coefficient 0.05 instead of the typically used 0.01, achieves a new state-of-the-art in cross-play between different seeds, beating by a significant margin all previous specialized algorithms, which were specifically designed for this setting. We provide an intuition for why sufficiently high entropy regularization ensures that different random seed produce joint policies which are mutually compatible. We also empirically find that a high around 0.9, and using RNNs instead of just feed-forward layers in the actor-critic architecture, strongly increase inter-seed cross-play. While these results demonstrate the dramatic effect that hyperparameters can have not just on self-play scores but also on cross-play scores, we show that there are simple Dec-POMDPs though, in which standard policy gradient methods with increased entropy regularization are not able to achieve perfect inter-seed cross-play, thus demonstrating the continuing necessity for new algorithms for zero-shot coordination.

Paper Structure

This paper contains 17 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Consider the one-round simultaneous action game with payoff matrix given in (\ref{['eq:payoff1']}), and the parametrization $\theta = (\theta_1, \theta_2) \mapsto \text{softmax}(\theta_1, -\theta_1, \theta_2) =: \pi^1_\theta(\cdot) = \pi^2_\theta(\cdot)$. Bottom: the function $\theta \mapsto J^\alpha_{\text{SP}}(\pi_\theta)$, with $\alpha = 1.0$ on the left and $\alpha = 1.2$ on the right. Top left: $\pi_{\theta_\infty}$ and its return $J_{\text{SP}}(\pi_{\theta_\infty})$ as a function of the entropy coefficient that was used during training of $\pi_\theta$. Top right: XP matrix between multiple greedified joint policies, which were trained with independent REINFORCE, with different entropy coefficients. 5 seeds per entropy coefficient.
  • Figure 2: Toy cooperative communication game (figure taken from hu2021off, with kind permission of the authors): Alice observes the pet and can either signal "on", signal "off", "reveal" for a reward of $-3$, so that Bob can see the pet, or "bail" out for a reward of $1$. Bob then can guess "cat" or "dog" for a reward of $\pm 10$ depending on whether he was correct, or he can "bail" out for a reward of $1$.
  • Figure 3: Left: XP matrix between policies in the cat/dog game, which are trained with entropy regularized independent REINFORCE with baseline, with different entropy coefficients. During training the policies sample actions, and during XP they always take the action with the highest probability. Right: same as left, except that the reward for "reveal" is now $-8$ instead of $-3$.
  • Figure 4: 2-Player Hanabi: Block XP matrices between greedified policies trained with IPPO with different entropy coefficients and $\lambda_{\text{GAE}} = 0.9$. Four seeds per entropy coefficient. The scores in the parentheses are the averages of the SP scores in the diagonal 4x4 blocks, and all other scores is the average of the XP scores in the 4x4 blocks. Top: LSTM. Middle: PP LSTM. Bottom: FF. For the full XP matrices see Figures \ref{['fig:XP_IPPO_LSTM_FULL']}, \ref{['fig:XP_IPPO_LSTM_PP_FULL']}, and \ref{['fig:XP_IPPO_FF_FULL']} in the appendix.
  • Figure 5: Public-Private LSTM Actor-Critic Architecture: In Hanabi, a player's public observation includes everything except for the player hands. A player's private observation is the public observation plus the other players' hands. The state is the public observation plus all players' hands. For IPPO one doesn't need a separate MLP for the critic, as the critic conditions only the local AOH $\tau_t^i$, just like the actor. For IPPO, the critic MLP just receives the private observation $o_t^{i, \text{private}}$ as well. Both MLP streams have 3 hidden layers, and the LSTM stream has one feedforward embedding layer and two LSTM layers.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Remark 2.2
  • Definition 2.3: Cross-Play (XP)