Table of Contents
Fetching ...

Augmenting the action space with conventions to improve multi-agent cooperation in Hanabi

F. Bredell, H. A. Engelbrecht, J. C. Schoeman

TL;DR

This work tackles cooperative multi-agent reinforcement learning in Hanabi under partial observability and limited communication. It introduces artificial conventions as cooperative actions that span multiple time steps and agents, integrated via action-space augmentation and a subscribing mechanism for continuation. Empirical results show that Rainbow agents equipped with conventions train significantly faster and achieve higher or comparable scores in self-play and cross-play across 2–5 players, with particularly strong gains in larger groups. The findings suggest conventions as a scalable approach to implicit coordination and inspire future work on convention discovery and richer multi-step conventions for MARL.

Abstract

The card game Hanabi is considered a strong medium for the testing and development of multi-agent reinforcement learning (MARL) algorithms, due to its cooperative nature, partial observability, limited communication and remarkable complexity. Previous research efforts have explored the capabilities of MARL algorithms within Hanabi, focusing largely on advanced architecture design and algorithmic manipulations to achieve state-of-the-art performance for various number of cooperators. However, this often leads to complex solution strategies with high computational cost and requiring large amounts of training data. For humans to solve the Hanabi game effectively, they require the use of conventions, which often allows for a means to implicitly convey ideas or knowledge based on a predefined, and mutually agreed upon, set of "rules" or principles. Multi-agent problems containing partial observability, especially when limited communication is present, can benefit greatly from the use of implicit knowledge sharing. In this paper, we propose a novel approach to augmenting an agent's action space using conventions, which act as a sequence of special cooperative actions that span over and include multiple time steps and multiple agents, requiring agents to actively opt in for it to reach fruition. These conventions are based on existing human conventions, and result in a significant improvement on the performance of existing techniques for self-play and cross-play for various number of cooperators within Hanabi.

Augmenting the action space with conventions to improve multi-agent cooperation in Hanabi

TL;DR

This work tackles cooperative multi-agent reinforcement learning in Hanabi under partial observability and limited communication. It introduces artificial conventions as cooperative actions that span multiple time steps and agents, integrated via action-space augmentation and a subscribing mechanism for continuation. Empirical results show that Rainbow agents equipped with conventions train significantly faster and achieve higher or comparable scores in self-play and cross-play across 2–5 players, with particularly strong gains in larger groups. The findings suggest conventions as a scalable approach to implicit coordination and inspire future work on convention discovery and richer multi-step conventions for MARL.

Abstract

The card game Hanabi is considered a strong medium for the testing and development of multi-agent reinforcement learning (MARL) algorithms, due to its cooperative nature, partial observability, limited communication and remarkable complexity. Previous research efforts have explored the capabilities of MARL algorithms within Hanabi, focusing largely on advanced architecture design and algorithmic manipulations to achieve state-of-the-art performance for various number of cooperators. However, this often leads to complex solution strategies with high computational cost and requiring large amounts of training data. For humans to solve the Hanabi game effectively, they require the use of conventions, which often allows for a means to implicitly convey ideas or knowledge based on a predefined, and mutually agreed upon, set of "rules" or principles. Multi-agent problems containing partial observability, especially when limited communication is present, can benefit greatly from the use of implicit knowledge sharing. In this paper, we propose a novel approach to augmenting an agent's action space using conventions, which act as a sequence of special cooperative actions that span over and include multiple time steps and multiple agents, requiring agents to actively opt in for it to reach fruition. These conventions are based on existing human conventions, and result in a significant improvement on the performance of existing techniques for self-play and cross-play for various number of cooperators within Hanabi.

Paper Structure

This paper contains 25 sections, 10 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Architecture design for a feed forward neural network (NN) with an augmented action-convention space applied. The environment provides the observation $O_t$ and receives the environment action $A_t$. The size of the input layer is equal to the size of the observation tuple ($\rho$), and each hidden layer has a size equal to the number of atoms for that layer. The size of the output layer is equal to the size of the augmented action-convention space ($|C|$), and the action selection is determined by the MARL algorithm. Finally, the chosen augmented action-convention determines the specific convention$c_k$ and its step $m$, which in turn is used to produce the environment action $A_t$ according to the policy $\pi_k^m$ and the observation $O_t$.
  • Figure 2: An example game of Hanabi as seen from the perspective of player 1. There are four hint tokens left, and the players have lost one of their shared life tokens. Player 1 knows about two 4s in their hand and the green, blue and yellow stacks have been partially completed, leading to a current game score of 6/25. It is now player 1's turn, and they can take the hint (to player 2 or 3), play (from their hand) or discard (from their hand) action.
  • Figure 3: (a) Learning curves for independent Deep Q-learning (DQN) with a primitive-action space, pure conventions space, and an augmented and simplified action-convention space tested in our in-house Small Hanabi environment. (b) Learning curves for Rainbow with a primitive-action space, pure conventions space, and an augmented and simplified action-convention space tested in DeepMind's Hanabi learning environment with the Small Hanabi preset hanabi_ai.
  • Figure 4: Learning curves with exponential moving averages (weight=0.9995) for Rainbow, obtained from Bard et al.hanabi_ai, as baseline compared to Rainbow with an augmented action-convention space for Hanabi two-to five-players. The best agent is highlighted in black for each agent scenario within each player count.
  • Figure 5: Distribution of scores for 2--5 player Hanabi over the course of 1000 evaluation episodes comparing baseline Rainbow and Rainbow with an augmented action-convention space. The baseline Rainbow results were obtained from Bard et al.hanabi_ai.