Table of Contents
Fetching ...

Learning to Cooperate with Humans using Generative Agents

Yancheng Liang, Daphne Chen, Abhishek Gupta, Simon S. Du, Natasha Jaques

TL;DR

A method for posterior sampling from the generative model that is biased towards the human data is proposed, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

Abstract

Training agents that can coordinate zero-shot with humans is a key mission in multi-agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent. The simulated human is produced either through behavior cloning over a dataset of human cooperation behavior, or by using MARL to create a population of simulated agents. However, these approaches often struggle to produce a Cooperator that can coordinate well with real humans, since the simulated humans fail to cover the diverse strategies and styles employed by people in the real world. We show \emph{learning a generative model of human partners} can effectively address this issue. Our model learns a latent variable representation of the human that can be regarded as encoding the human's unique strategy, intention, experience, or style. This generative model can be flexibly trained from any (human or neural policy) agent interaction data. By sampling from the latent space, we can use the generative model to produce different partners to train Cooperator agents. We evaluate our method -- \textbf{G}enerative \textbf{A}gent \textbf{M}odeling for \textbf{M}ulti-agent \textbf{A}daptation (GAMMA) -- on Overcooked, a challenging cooperative cooking game that has become a standard benchmark for zero-shot coordination. We conduct an evaluation with real human teammates, and the results show that GAMMA consistently improves performance, whether the generative model is trained on simulated populations or human datasets. Further, we propose a method for posterior sampling from the generative model that is biased towards the human data, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

Learning to Cooperate with Humans using Generative Agents

TL;DR

A method for posterior sampling from the generative model that is biased towards the human data is proposed, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

Abstract

Training agents that can coordinate zero-shot with humans is a key mission in multi-agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent. The simulated human is produced either through behavior cloning over a dataset of human cooperation behavior, or by using MARL to create a population of simulated agents. However, these approaches often struggle to produce a Cooperator that can coordinate well with real humans, since the simulated humans fail to cover the diverse strategies and styles employed by people in the real world. We show \emph{learning a generative model of human partners} can effectively address this issue. Our model learns a latent variable representation of the human that can be regarded as encoding the human's unique strategy, intention, experience, or style. This generative model can be flexibly trained from any (human or neural policy) agent interaction data. By sampling from the latent space, we can use the generative model to produce different partners to train Cooperator agents. We evaluate our method -- \textbf{G}enerative \textbf{A}gent \textbf{M}odeling for \textbf{M}ulti-agent \textbf{A}daptation (GAMMA) -- on Overcooked, a challenging cooperative cooking game that has become a standard benchmark for zero-shot coordination. We conduct an evaluation with real human teammates, and the results show that GAMMA consistently improves performance, whether the generative model is trained on simulated populations or human datasets. Further, we propose a method for posterior sampling from the generative model that is biased towards the human data, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

Paper Structure

This paper contains 32 sections, 3 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: We show the latent space covered by different methods.For either simulated data or human data, the generative agents produced by GAMMA can cover a larger strategy space. Generative models can provide novel agents by interpolating the agents in the simulated population (a). On human data (b), the human proxy model only covers a subset of all human player behavior patterns, while the generative model can capture the diversity in the data. We can also control the latent space sampling (c) to model a target population of agents (e.g., human coordinators).
  • Figure 2: Overview of the method for GAMMA. The generative model learns a latent distribution over partner strategies from either simulated or human data. Sampling partners from the generative model enables training a robust Cooperator that can coordinate with a variety of different humans.
  • Figure 3: The first five layouts Cramped Room, Asymmetric Advantages, Coordination Ring, Forced Coordination, Counter Circuit are originally proposed in carroll2019utility. We create an additional Multi-strategy Counter layout. In this new layout, humans can additionally choose between making onion vs. tomato soup, which makes coordination significantly more challenging.
  • Figure 4: Evaluation of different methods using a human proxy model. Rewards are normalized by the highest reward achieved on each layout. The learning curves in (a) show the average normalized reward across all environments, indicating that GAMMA helps the Cooperator converge to a higher reward. This improvement is also consistent across individual layouts, as illustrated in (b) and (c). We observe the largest performance gap on the 'Counter Circuit' and 'Multi-Strategy Counter' layouts, which are the most complex in terms of the number of valid cooperation strategies.
  • Figure 5: Performance of different agents when played with real humans. Error bars cumming2007error use the Standard Error of the Mean (SE) for statistical significance ($p < 0.05$). Methods trained on human data are shown in green. Whether training with simulated or human data, GAMMA shows consistent, statistically significant advantages over the baselines. GAMMA-HA is able to efficiently use the real human dataset to learn a better sampling of its latent space, achieving the best performance when cooperating with real humans.
  • ...and 10 more figures