Table of Contents
Fetching ...

On the Utility of Learning about Humans for Human-AI Coordination

Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan

TL;DR

The paper argues that agents trained via self-play tend to coordinate with AI partners rather than humans, revealing a distributional shift when tested with people. It introduces a simplified Overcooked-like environment and trains human models through behavior cloning to study human-AI collaboration, comparing self-play, population-based training, and human-model-based methods. Key findings show that agents trained with human data (PPO_BC) coordinate more effectively with humans than those relying on self-play, and planning that leverages a correct human model yields additional gains, while poor models can hinder performance. The work emphasizes incorporating human behavior into training and offers practical directions for designing more human-aware coordination systems.

Abstract

While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves. Agents that assume their partner to be optimal or similar to them can converge to coordination protocols that fail to understand and be understood by humans. To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. We evaluate the performance of agents trained via self-play and population-based training. These agents perform very well when paired with themselves, but when paired with our human model, they are significantly worse than agents designed to play with the human model. An experiment with a planning algorithm yields the same conclusion, though only when the human-aware planner is given the exact human model that it is playing with. A user study with real humans shows this pattern as well, though less strongly. Qualitatively, we find that the gains come from having the agent adapt to the human's gameplay. Given this result, we suggest several approaches for designing agents that learn about humans in order to better coordinate with them. Code is available at https://github.com/HumanCompatibleAI/overcooked_ai.

On the Utility of Learning about Humans for Human-AI Coordination

TL;DR

The paper argues that agents trained via self-play tend to coordinate with AI partners rather than humans, revealing a distributional shift when tested with people. It introduces a simplified Overcooked-like environment and trains human models through behavior cloning to study human-AI collaboration, comparing self-play, population-based training, and human-model-based methods. Key findings show that agents trained with human data (PPO_BC) coordinate more effectively with humans than those relying on self-play, and planning that leverages a correct human model yields additional gains, while poor models can hinder performance. The work emphasizes incorporating human behavior into training and offers practical directions for designing more human-aware coordination systems.

Abstract

While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves. Agents that assume their partner to be optimal or similar to them can converge to coordination protocols that fail to understand and be understood by humans. To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. We evaluate the performance of agents trained via self-play and population-based training. These agents perform very well when paired with themselves, but when paired with our human model, they are significantly worse than agents designed to play with the human model. An experiment with a planning algorithm yields the same conclusion, though only when the human-aware planner is given the exact human model that it is playing with. A user study with real humans shows this pattern as well, though less strongly. Qualitatively, we find that the gains come from having the agent adapt to the human's gameplay. Given this result, we suggest several approaches for designing agents that learn about humans in order to better coordinate with them. Code is available at https://github.com/HumanCompatibleAI/overcooked_ai.

Paper Structure

This paper contains 20 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The impact of incorrect expectations of optimality. Left: In a competitive game, the agent plans for the worst case. $\mathbf{AI}$ expects that if it goes left, $\mathbf{H}$ will go left. So, it goes right where it expects to get 3 reward (since $\mathbf{H}$ would go left). When $\mathbf{H}$ suboptimally goes right, $\mathbf{AI}$ gets 7 reward: more than it expected. Right: In a collaborative game, $\mathbf{AI}$ expects $\mathbf{H}$ to coordinate with it to choose the best option, and so it goes left to obtain the 8 reward. However, when $\mathbf{H}$ suboptimally goes left, $\mathbf{AI}$ only gets 1 reward: the worst possible outcome!
  • Figure 2: Our Overcooked environment. The goal is to place three onions in a pot (dark grey), take out the resulting soup on a plate (white) and deliver it (light grey), as many times as possible within the time limit. $\mathbf{H}$, the human, is close to a dish dispenser and a cooked soup, and $\mathbf{AI}$, the agent, is facing a pot that is not yet full. The optimal strategy is for $\mathbf{H}$ to put an onion in the partially full pot, and for $\mathbf{AI}$ to put the existing soup in a dish and deliver it. This is due to the layout structure, that makes $\mathbf{H}$ have an advantage in placing onions in pots, and $\mathbf{AI}$ have an advantage in delivering soups. However, we can guess that $\mathbf{H}$ plans to pick up a plate to deliver the soup. If $\mathbf{AI}$ nonetheless expects $\mathbf{H}$ to be optimal, it will expect $\mathbf{H}$ to turn around to get the onion, and will continue moving towards its own dish dispenser, leading to a coordination failure.
  • Figure 3: Experiment layouts. From left to right: Cramped Room presents low-level coordination challenges: in this shared, confined space it is very easy for the agents to collide. Asymmetric Advantages tests whether players can choose high-level strategies that play to their strengths, as illustrated in Figure \ref{['fig:overcooked']}. In Coordination Ring, players must coordinate to travel between the bottom left and top right corners of the layout. Forced Coordination instead removes collision coordination problems, and forces players to develop a high-level joint strategy, since neither player can serve a dish by themselves. Counter Circuit involves a non-obvious coordination strategy, where onions are passed over the counter to the pot, rather than being carried around.
  • Figure 4: Rewards over trajectories of 400 timesteps for the different agents (agents trained with themselves -- SP or PBT -- in teal, agents trained with the human model -- PPO$_{BC}$ -- in orange, and imitation agents -- BC -- in gray), with standard error over 5 different seeds, paired with the proxy human H$_{Proxy}$. The white bars correspond to what the agents trained with themselves expect to achieve, i.e. their performance when paired with itself (SP+SP and PBT+PBT). First, these agents perform much worse with the proxy human than with themselves. Second, the PPO agent that trains with human data performs much better, as hypothesized. Third, imitation tends to perform somewhere in between the two other agents. The red dotted lines show the "gold standard" performance achieved by a PPO agent with direct access to the proxy model itself -- the difference in performance between this agent and PPO$_{BC}$ stems from the innacuracy of the BC human model with respect to the actual $H_{Proxy}$. The hashed bars show results with the starting position of the agents switched. This most makes a difference for asymmetric layouts such as Asymmetric Advantages or Forced Coordination.
  • Figure 5: Comparison across planning methods. We see a similar trend: coupled planning (CP) performs well with itself (CP+CP) and worse with the proxy human (CP+H$_{Proxy}$). Having the correct model of the human (the dotted line) helps, but a bad model (P$_{BC}$+H$_{Proxy}$) can be much worse because agents get stuck (see Appendix \ref{['appendix:planning-exp']}).
  • ...and 8 more figures