A Generalist Hanabi Agent
Arjun V Sudhakar, Hadi Nekoei, Mathieu Reymond, Miao Liu, Janarthanan Rajendran, Sarath Chandar
TL;DR
This work tackles the limited generalization of MARL agents in Hanabi across varying partner configurations by introducing R3D2, a generalist Hanabi agent that frames the game as a text-based environment. It combines a language-informed observation and action representation with a DRRN-inspired architecture to handle dynamic state/action spaces, trained via distributed self-play across 2–5 players. The key contributions include a player-agnostic, variable-player learning framework, strong zero-shot coordination across novel settings and partners, and empirical evidence that text-based transfer plus dynamic action handling yields robust coordination, even when paired with diverse algorithms. While LLMs alone do not yet solve Hanabi, R3D2 demonstrates that language-based transfer and multi-setting training enable practical generalization and collaboration in complex cooperative games.
Abstract
Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents -- agents that are themselves unable to do so. The implementation code is available at: $\href{https://github.com/chandar-lab/R3D2-A-Generalist-Hanabi-Agent}{R3D2-A-Generalist-Hanabi-Agent}$
