Cooperative Open-ended Learning Framework for Zero-shot Coordination

Yang Li; Shao Zhang; Jichen Sun; Yali Du; Ying Wen; Xinbing Wang; Wei Pan

Cooperative Open-ended Learning Framework for Zero-shot Coordination

Yang Li, Shao Zhang, Jichen Sun, Yali Du, Ying Wen, Xinbing Wang, Wei Pan

TL;DR

We address zero-shot coordination in two-player cooperative games by reframing tasks with Graphic-Form Games and Preference Graphic-Form Games, and by implementing the Cooperative Open-ended Learning (COLE) framework to identify and overcome cooperative incompatibility. A practical instantiation, COLE_SV, combines a Graphic Shapley Value-based solver with a trainer that optimizes a joint objective balancing individual performance and cooperation over a cooperative-incompatibility distribution. The authors prove convergence to a local best-preferred strategy with a Q-sublinear rate under in-degree centrality and validate performance in Overcooked, where COLE_SV outperforms state-of-the-art baselines against unseen partners. The work advances open-ended MARL with graph-theoretic objective design, enabling more robust zero-shot coordination and shedding light on how to mitigate cooperative incompatibility in cooperative AI.

Abstract

Zero-shot coordination in cooperative artificial intelligence (AI) remains a significant challenge, which means effectively coordinating with a wide range of unseen partners. Previous algorithms have attempted to address this challenge by optimizing fixed objectives within a population to improve strategy or behaviour diversity. However, these approaches can result in a loss of learning and an inability to cooperate with certain strategies within the population, known as cooperative incompatibility. To address this issue, we propose the Cooperative Open-ended LEarning (COLE) framework, which constructs open-ended objectives in cooperative games with two players from the perspective of graph theory to assess and identify the cooperative ability of each strategy. We further specify the framework and propose a practical algorithm that leverages knowledge from game theory and graph theory. Furthermore, an analysis of the learning process of the algorithm shows that it can efficiently overcome cooperative incompatibility. The experimental results in the Overcooked game environment demonstrate that our method outperforms current state-of-the-art methods when coordinating with different-level partners. Our demo is available at https://sites.google.com/view/cole-2023.

Cooperative Open-ended Learning Framework for Zero-shot Coordination

TL;DR

Abstract

Paper Structure (27 sections, 5 theorems, 13 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 5 theorems, 13 equations, 8 figures, 1 table, 2 algorithms.

Introduction
Related Works
Preliminaries
Normal-form Game.
Empirical Game-theoretic Analysis (EGTA), Empirical Game and Empirical Gamescape.
Cooperative Theoretic Concepts.
Centrality in Graph Theory.
Cooperative Open-Ended Learning
Graphic-Form Games (GFGs)
Cooperative Open-Ended Learning Framework
Practical Algorithm
Solver: Graphic Shapley Value
Trainer: Approximating local best-preferred Strategy
Experiments
Environment and Experimental Setting
...and 12 more sections

Key Result

Theorem 4.4

Let $s_0\in {\mathcal{S}}$ be the initial strategy and $s_i=\operatorname{oracle}(s_{i-1})$ for $i \in \mathbb{N}$. Under the effective functioning of the approximated oracle as characterized by Eq. eq:oracle_approx, we can say that the sequence $\{s_i\}$ for ${i\in \mathbb{N}}$ could converge to a

Figures (8)

Figure 1: The Game Graph, (sub-) preference graph and corresponding preference centrality matrix. The (sub-) preference graphs are for all four iterations in the training process, and the corresponding preference in-degree centrality matrix is based on them. As can be observed in the ${\mathcal{G}}^\prime_3$ and ${\mathcal{G}}^\prime_4$, the newly updated strategies fail to be preferred by others and have centrality values of 1, despite an increase in the mean of rewards with all others. In (b), we illustrate an ideal learning process in which a newly generated strategy can achieve higher outcomes with all previous strategies.
Figure 2: The payoff matrix of each strategy during training and the corresponding preference centrality matrix of the MEP algorithm in the Overcooked. The darker the color in the payoff matrix, the higher the rewards. The darker the color in the preference centrality matrix, the lower the centrality value, and the more other strategies prefer it.
Figure 3: An overview of one generation in COLE framework: The solver derives the cooperative incompatible distribution $\phi$ using a cooperative incompatibility solver, which can be any algorithm that evaluates cooperative contribution. The trainer then approximates the relaxed best response by optimizing individual and cooperative compatible objectives. The oracle's training data is generated using partners selected based on the cooperative incompatibility distribution and the agent's strategy. Finally, the approximated strategy $s_{n+1}$ is added to the population, and the next generation begins.
Figure 4: The result of the combining objectives' effectiveness evaluation. Mean episode rewards over 400 timesteps trajectories for $\text{COLE}_\text{SV}$ s with different objective ratios 0:4, 1:3, 2:2, and 3:1, paired with the unseen middle-level partner $H_{proxy}$. The gray bars behind present the rewards of self-play.
Figure 5: Performance with middle-level partners. The performance of $\text{COLE}_\text{SV}$ with middle-level partners is presented in terms of mean episode rewards over 400 timesteps trajectories for different objective ratios of 0:4 and 1:3, when paired with the unseen middle-level partner $H_{proxy}$. The results include the mean and standard error over five different random seeds. The gray and hashed bars indicate the rewards obtained when playing with themselves and the performance when starting positions are switched.
...and 3 more figures

Theorems & Definitions (13)

Definition 4.1: Graphic-Form Game
Definition 4.2: Preference Graphic-Form Game
Definition 4.3: Preference Centrality
Theorem 4.4
proof
Corollary 4.4
proof
Theorem 2.1
proof
Lemma 2.1
...and 3 more

Cooperative Open-ended Learning Framework for Zero-shot Coordination

TL;DR

Abstract

Cooperative Open-ended Learning Framework for Zero-shot Coordination

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)