Table of Contents
Fetching ...

The Yokai Learning Environment: Tracking Beliefs Over Space and Time

Constantin Ruhdorfer, Matteo Bortoletto, Johannes Forkel, Jakob Foerster, Andreas Bulling

TL;DR

The Yokai Learning Environment (YLE) is introduced - an open-source multi-agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge - features absent in the HLE.

Abstract

The ability to cooperate with unknown partners is a central challenge in cooperative AI and widely studied in the form of zero-shot coordination (ZSC), which evaluates an algorithm by measuring the performance of independently trained agents when paired. The Hanabi Learning Environment (HLE) has become the dominant benchmark for ZSC, but recent work has achieved near-perfect inter-seed cross-play performance, limiting its ability to track algorithmic progress. We introduce the Yokai Learning Environment (YLE) - an open-source multi-agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge - features absent in the HLE, where beliefs are tied to hand slots and hints are truthful by rule. We evaluate the leading ZSC methods, including High-Entropy IPPO, Other-Play, and Off-Belief Learning, which achieve near-perfect inter-seed cross-play in the HLE, and show that in the YLE they exhibit persistent SP-XP gaps, degraded early-ending calibration, and weaker belief representations in cross-play, indicating failure to maintain consistent internal models with unseen partners. Methods that perform best in the HLE do not perform best in the YLE, indicating that progress measured on a single benchmark may not generalise. Together, these results establish YLE as a challenging new ZSC benchmark.

The Yokai Learning Environment: Tracking Beliefs Over Space and Time

TL;DR

The Yokai Learning Environment (YLE) is introduced - an open-source multi-agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge - features absent in the HLE.

Abstract

The ability to cooperate with unknown partners is a central challenge in cooperative AI and widely studied in the form of zero-shot coordination (ZSC), which evaluates an algorithm by measuring the performance of independently trained agents when paired. The Hanabi Learning Environment (HLE) has become the dominant benchmark for ZSC, but recent work has achieved near-perfect inter-seed cross-play performance, limiting its ability to track algorithmic progress. We introduce the Yokai Learning Environment (YLE) - an open-source multi-agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge - features absent in the HLE, where beliefs are tied to hand slots and hints are truthful by rule. We evaluate the leading ZSC methods, including High-Entropy IPPO, Other-Play, and Off-Belief Learning, which achieve near-perfect inter-seed cross-play in the HLE, and show that in the YLE they exhibit persistent SP-XP gaps, degraded early-ending calibration, and weaker belief representations in cross-play, indicating failure to maintain consistent internal models with unseen partners. Methods that perform best in the HLE do not perform best in the YLE, indicating that progress measured on a single benchmark may not generalise. Together, these results establish YLE as a challenging new ZSC benchmark.

Paper Structure

This paper contains 54 sections, 9 equations, 27 figures, 8 tables.

Figures (27)

  • Figure 1: The YLE poses a challenging ToM reasoning task for ZSC. Agents cannot observe all cards in a single game. Successful play requires agents to reason about other agents' knowledge and beliefs. In the example, it's 's turn. observes the colour of two cards (1 and 2) privately. recalls from earlier that card 1 is blue. When moves card 1 next to card 2, can infer that the second card is blue as well. and can use this common ground going forward.
  • Figure 2: A sample round in nine-card YLE. first observes two cards privately (one blue, one green), moves the blue card, and finishes by either (4a) revealing or (4b) placing a hint. places a hint. Then, can start their turn. chooses to end the game. Ending the game will return the final reward based on the outcome. This game is lost because green and red cards are not clustered.
  • Figure 3: Legal moves for the card with the blue dot. Cards can only be moved so that all cards remain connected via their sides. Cards with an orange dot cannot currently be moved.
  • Figure 4: A second-order ToM reasoning example over multiple timesteps in 2-player YLE. beliefs that beliefs that they ( ) knew that cards 1 and 2 are of the same colour. Even though never saw card 1 and never observed card 3, both now know where all blue cards are, that they are grouped and that both share this knowledge as part of their common ground. In the future, they now can potentially finish early as they have one less card that needs to be observed.
  • Figure 5: Probing card colours from the hidden state in self-play and cross-play over timesteps.
  • ...and 22 more figures

Theorems & Definitions (1)

  • Definition 3.1: Dec-POMDP Oliehoek2016