Entropy is all you need for Inter-Seed Cross-Play in Hanabi
Johannes Forkel, Jakob Foerster
TL;DR
The paper investigates zero-shot coordination in Hanabi and shows that simple entropy-regularized IPPO can dramatically improve cross-seed cross-play, achieving new state-of-the-art XP. It demonstrates that a moderate entropy level (e.g., $\alpha \approx 0.05$) and recurrent actor-critic architectures, paired with $\lambda_{GAE} \approx 0.9$, substantially reduce symmetry-breaking conventions across seeds. The authors provide both toy and full Hanabi analyses, highlighting that while entropy can align inter-seed policies, there exist Dec-POMDPs where entropy alone cannot guarantee optimal symmetric strategies, thus motivating ongoing development of dedicated ZSC algorithms. The results offer practical guidance for hyperparameter choices in cross-seed MARL experiments and underscore the ongoing importance of zero-shot coordination research in complex cooperative tasks.
Abstract
We find that in Hanabi, one of the most complex and popular benchmarks for zero-shot coordination and ad-hoc teamplay, a standard implementation of independent PPO with a slightly higher entropy coefficient 0.05 instead of the typically used 0.01, achieves a new state-of-the-art in cross-play between different seeds, beating by a significant margin all previous specialized algorithms, which were specifically designed for this setting. We provide an intuition for why sufficiently high entropy regularization ensures that different random seed produce joint policies which are mutually compatible. We also empirically find that a high $λ_{\text{GAE}}$ around 0.9, and using RNNs instead of just feed-forward layers in the actor-critic architecture, strongly increase inter-seed cross-play. While these results demonstrate the dramatic effect that hyperparameters can have not just on self-play scores but also on cross-play scores, we show that there are simple Dec-POMDPs though, in which standard policy gradient methods with increased entropy regularization are not able to achieve perfect inter-seed cross-play, thus demonstrating the continuing necessity for new algorithms for zero-shot coordination.
