Table of Contents
Fetching ...

A Generalist Hanabi Agent

Arjun V Sudhakar, Hadi Nekoei, Mathieu Reymond, Miao Liu, Janarthanan Rajendran, Sarath Chandar

TL;DR

This work tackles the limited generalization of MARL agents in Hanabi across varying partner configurations by introducing R3D2, a generalist Hanabi agent that frames the game as a text-based environment. It combines a language-informed observation and action representation with a DRRN-inspired architecture to handle dynamic state/action spaces, trained via distributed self-play across 2–5 players. The key contributions include a player-agnostic, variable-player learning framework, strong zero-shot coordination across novel settings and partners, and empirical evidence that text-based transfer plus dynamic action handling yields robust coordination, even when paired with diverse algorithms. While LLMs alone do not yet solve Hanabi, R3D2 demonstrates that language-based transfer and multi-setting training enable practical generalization and collaboration in complex cooperative games.

Abstract

Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents -- agents that are themselves unable to do so. The implementation code is available at: $\href{https://github.com/chandar-lab/R3D2-A-Generalist-Hanabi-Agent}{R3D2-A-Generalist-Hanabi-Agent}$

A Generalist Hanabi Agent

TL;DR

This work tackles the limited generalization of MARL agents in Hanabi across varying partner configurations by introducing R3D2, a generalist Hanabi agent that frames the game as a text-based environment. It combines a language-informed observation and action representation with a DRRN-inspired architecture to handle dynamic state/action spaces, trained via distributed self-play across 2–5 players. The key contributions include a player-agnostic, variable-player learning framework, strong zero-shot coordination across novel settings and partners, and empirical evidence that text-based transfer plus dynamic action handling yields robust coordination, even when paired with diverse algorithms. While LLMs alone do not yet solve Hanabi, R3D2 demonstrates that language-based transfer and multi-setting training enable practical generalization and collaboration in complex cooperative games.

Abstract

Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents -- agents that are themselves unable to do so. The implementation code is available at:

Paper Structure

This paper contains 34 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: The template used for textual observations in Hanabi. It includes all necessary information to play the game, including life and clue tokens, visible hands, discarded cards, and hints.
  • Figure 2: An overview of the R3D2 architecture. R3D2 uses a separate head for observations and actions. Each head starts with 2 TinyBERT layers to encode the textual representation, followed by a LSTM layer to encode the previous timesteps. We use elementwise multiplication to combine both embeddings. This is then split into a separate value and advantage head, before being summed together to obtain $Q_\theta(\tau, a)$.
  • Figure 3: Policy Transfer - Zeroshot setting. Each subplot shows the evaluation setting for a $n$-player game. Each bar combines $0 < i < n$ agents trained on a different setting, with $n-i$ players trained on $n$-player games. R3D2 agents demonstrate strong zero-shot generalization to novel settings. Moreover, R2D2-text seems to be unable to match R3D2's transfer performance specially when transferring from a setting with large number of actions to smaller action space.
  • Figure 4: Selfplay, intra-XP and inter-XP performance in 2-player setting averaged across three independent seeds per method. R3D2 achieves significantly better inter-XP compared to the baselines while maintaining a competitive SP and intra-XP.
  • Figure 5: 2 player zero-shot coordination matrix between different methods and three independent seeds per method. R3D2 achieves better inter-XP with IQL, R2D2-OBL and R2D2-OP.
  • ...and 9 more figures