Table of Contents
Fetching ...

OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination

Tobias Gessler, Tin Dizdarevic, Ani Calinescu, Benjamin Ellis, Andrei Lupu, Jakob Nicolaus Foerster

TL;DR

The paper investigates zero-shot coordination in multi-agent systems, arguing that the traditional Overcooked benchmark underestimates coordination challenges due to limited state coverage. It introduces state augmentation to broaden training exposure and finds that cross-play gaps largely vanish if agents are trained on diverse states. To push beyond state coverage, it then presents OvercookedV2, a successor environment with partial observability, asymmetric information, stochastic recipes, grounded communication, and test-time protocol formation scenarios. Experiments show that even with architectural improvements and baseline ZSC methods, cross-play remains difficult in OvercookedV2, highlighting the need for online-adaptive coordination algorithms and broader, more diverse evaluation benchmarks. Overall, OvercookedV2 provides a more rigorous platform to benchmark ZSC and drive development of AI systems that can coordinate with humans and other agents in dynamic, partially observable settings.

Abstract

AI agents hold the potential to transform everyday life by helping humans achieve their goals. To do this successfully, agents need to be able to coordinate with novel partners without prior interaction, a setting known as zero-shot coordination (ZSC). Overcooked has become one of the most popular benchmarks for evaluating coordination capabilities of AI agents and learning algorithms. In this work, we investigate the origins of ZSC challenges in Overcooked. We introduce a state augmentation mechanism which mixes states that might be encountered when paired with unknown partners into the training distribution, reducing the out-of-distribution challenge associated with ZSC. We show that independently trained agents under this algorithm coordinate successfully in Overcooked. Our results suggest that ZSC failure can largely be attributed to poor state coverage under self-play rather than more sophisticated coordination challenges. The Overcooked environment is therefore not suitable as a ZSC benchmark. To address these shortcomings, we introduce OvercookedV2, a new version of the benchmark, which includes asymmetric information and stochasticity, facilitating the creation of interesting ZSC scenarios. To validate OvercookedV2, we conduct experiments demonstrating that mere exhaustive state coverage is insufficient to coordinate well. Finally, we use OvercookedV2 to build a new range of coordination challenges, including ones that require test time protocol formation, and we demonstrate the need for new coordination algorithms that can adapt online. We hope that OvercookedV2 will help benchmark the next generation of ZSC algorithms and advance collaboration between AI agents and humans.

OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination

TL;DR

The paper investigates zero-shot coordination in multi-agent systems, arguing that the traditional Overcooked benchmark underestimates coordination challenges due to limited state coverage. It introduces state augmentation to broaden training exposure and finds that cross-play gaps largely vanish if agents are trained on diverse states. To push beyond state coverage, it then presents OvercookedV2, a successor environment with partial observability, asymmetric information, stochastic recipes, grounded communication, and test-time protocol formation scenarios. Experiments show that even with architectural improvements and baseline ZSC methods, cross-play remains difficult in OvercookedV2, highlighting the need for online-adaptive coordination algorithms and broader, more diverse evaluation benchmarks. Overall, OvercookedV2 provides a more rigorous platform to benchmark ZSC and drive development of AI systems that can coordinate with humans and other agents in dynamic, partially observable settings.

Abstract

AI agents hold the potential to transform everyday life by helping humans achieve their goals. To do this successfully, agents need to be able to coordinate with novel partners without prior interaction, a setting known as zero-shot coordination (ZSC). Overcooked has become one of the most popular benchmarks for evaluating coordination capabilities of AI agents and learning algorithms. In this work, we investigate the origins of ZSC challenges in Overcooked. We introduce a state augmentation mechanism which mixes states that might be encountered when paired with unknown partners into the training distribution, reducing the out-of-distribution challenge associated with ZSC. We show that independently trained agents under this algorithm coordinate successfully in Overcooked. Our results suggest that ZSC failure can largely be attributed to poor state coverage under self-play rather than more sophisticated coordination challenges. The Overcooked environment is therefore not suitable as a ZSC benchmark. To address these shortcomings, we introduce OvercookedV2, a new version of the benchmark, which includes asymmetric information and stochasticity, facilitating the creation of interesting ZSC scenarios. To validate OvercookedV2, we conduct experiments demonstrating that mere exhaustive state coverage is insufficient to coordinate well. Finally, we use OvercookedV2 to build a new range of coordination challenges, including ones that require test time protocol formation, and we demonstrate the need for new coordination algorithms that can adapt online. We hope that OvercookedV2 will help benchmark the next generation of ZSC algorithms and advance collaboration between AI agents and humans.

Paper Structure

This paper contains 41 sections, 21 figures, 11 tables, 1 algorithm.

Figures (21)

  • Figure 1: Overview of an OvercookedV2 layout with multiple ingredients, a dynamic recipe indicator, and an agent view radius of one cell.
  • Figure 2: Button Game. Cooperative guessing game with many equivalent communication actions. Alice (left) observes a pet, either a cat or a dog, and Bob (right) must guess the pet. Alice must press one of $N$ buttons, activating one of $2N$ light bulbs. The bulb's parity encodes the pet's identity, which Bob observes before guessing the pet. The game does not require coordination as Bob can always guess correctly by looking at the parity of the bulb. Correct/incorrect guesses are rewarded with +/-10 points.
  • Figure 3: Cross-play matrix for 10 independent SP agents and a best response (BR) to a uniform random agent (U). The game admits a globally optimal strategy for Bob. SP agents succeed in self-play but fail to coordinate with novel partners since they overfit to specific buttons and fail to generalise. The BR agent achieves perfect scores in all pairings. This demonstrates that apparent coordination issues can in some cases be entirely explained by lack of state coverage.
  • Figure 4: Cross-play matrix for the standard and state-augmented settings in the Counter Circuit layout. Ten agents were independently trained for each setting; each cell ($i$, $j$) of the matrix represents the average score across 500 episodes played by the $i$th and $j$th agents. Agents trained in the standard setting achieve high SP scores (diagonal), but most XP pairings perform poorly or fail completely. In the state-augmented setting, SP scores are slightly higher than in the standard setting, and XP scores are dramatically improved, with no total failures.
  • Figure 5: Handcrafted coordination challenges. We provide three classes of layouts, each with a simple (top row) and a more complex (bottom row) configuration. From the left to right: Grounded Coordination Simple/Ring layouts present a temporally extended version of the toy Cat-Dog coordination problem introduced by hu_off-belief_2021. Test-Time Protocol Formation Simple/Wide layouts require agents to form protocols at test-time, based on feedback they receive after delivery. Demo Cook Simple/Wide layouts require agents to rely on the meaning of the other agents' actions.
  • ...and 16 more figures