Table of Contents
Fetching ...

Improving Human-AI Coordination through Online Adversarial Training and Generative Models

Paresh Chaudhary, Yancheng Liang, Daphne Chen, Simon S. Du, Natasha Jaques

TL;DR

The paper tackles the difficulty of generalizing cooperative AI to diverse human partners by introducing GOAT, a framework that couples a frozen generative model of cooperative partner policies with online regret-based adversarial training. This design constrains the adversary to realistic, cooperative behaviors while using regret as a curriculum signal to continually challenge the Cooperator. GOAT achieves state-of-the-art zero-shot coordination across CMG, CRG, and Overcooked, including substantial gains in real-human evaluations (e.g., up to 38% improvement in a complex layout). The work advances practical human-AI coordination by improving generalization and sample efficiency, with potential implications for robotics, autonomous systems, and collaborative AI applications.

Abstract

Being able to cooperate with diverse humans is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the diversity of human behaviors. Adversarial training is a promising method that allows dynamic data generation and ensures that agents are robust. It creates a feedback loop where the agent's performance influences the generation of new adversarial data, which can be used immediately to train the agent. However, adversarial training is difficult to apply in a cooperative task; how can we train an adversarial cooperator? We propose a novel strategy that combines a pretrained generative model to simulate valid cooperative agent policies with adversarial training to maximize regret. We call our method GOAT: Generative Online Adversarial Training. In this framework, the GOAT dynamically searches the latent space of the generative model for coordination strategies where the learning policy, the Cooperator agent, underperforms. GOAT enables better generalization by exposing the Cooperator to various challenging interaction scenarios. We maintain realistic coordination strategies by keeping the generative model frozen, thus avoiding adversarial exploitation. We evaluate GOAT with real human partners, and the results demonstrate state of the art performance on the Overcooked benchmark, highlighting its effectiveness in generalizing to diverse human behaviors.

Improving Human-AI Coordination through Online Adversarial Training and Generative Models

TL;DR

The paper tackles the difficulty of generalizing cooperative AI to diverse human partners by introducing GOAT, a framework that couples a frozen generative model of cooperative partner policies with online regret-based adversarial training. This design constrains the adversary to realistic, cooperative behaviors while using regret as a curriculum signal to continually challenge the Cooperator. GOAT achieves state-of-the-art zero-shot coordination across CMG, CRG, and Overcooked, including substantial gains in real-human evaluations (e.g., up to 38% improvement in a complex layout). The work advances practical human-AI coordination by improving generalization and sample efficiency, with potential implications for robotics, autonomous systems, and collaborative AI applications.

Abstract

Being able to cooperate with diverse humans is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the diversity of human behaviors. Adversarial training is a promising method that allows dynamic data generation and ensures that agents are robust. It creates a feedback loop where the agent's performance influences the generation of new adversarial data, which can be used immediately to train the agent. However, adversarial training is difficult to apply in a cooperative task; how can we train an adversarial cooperator? We propose a novel strategy that combines a pretrained generative model to simulate valid cooperative agent policies with adversarial training to maximize regret. We call our method GOAT: Generative Online Adversarial Training. In this framework, the GOAT dynamically searches the latent space of the generative model for coordination strategies where the learning policy, the Cooperator agent, underperforms. GOAT enables better generalization by exposing the Cooperator to various challenging interaction scenarios. We maintain realistic coordination strategies by keeping the generative model frozen, thus avoiding adversarial exploitation. We evaluate GOAT with real human partners, and the results demonstrate state of the art performance on the Overcooked benchmark, highlighting its effectiveness in generalizing to diverse human behaviors.

Paper Structure

This paper contains 18 sections, 4 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Adversarial training framework for cooperative agents. (Left) A generative model encodes simulated agents into a latent space to learn diverse agent strategies, which are then used to generate different types of training partners. (Right) GOAT samples new partners to maximize the Cooperator agent's regret, defined as the performance gap between self-play (the partner playing with itself) and cross-play (the partner playing with the Cooperator). The key idea is that the adversarial objective is constrained by the frozen generative model, which prevents it from generating self-sabotaging partners. But by applying regret-based adversarial training to search over the policies that can be generated by the model, we can expose the Cooperator to a curriculum of challenging training partners, ensuring it is robust to interacting with diverse human partners at test time.
  • Figure 2: $a)$ Cooperative Matrix Game. $b) \text{ to } g)$ Policy probability distribution to show coverage of different methods on the CMG payoff matrix $h)$ Total expected rewards for each method, assuming we uniformly sample partners and each method gets a payoff for each partner proportional to the amount of coverage they have for the reward block.
  • Figure 3: CRG: average reward obtained against the 11 Heuristic Agent teammates across 5 seeds of each method.
  • Figure 4: Overcooked: carroll2020 first introduced the most challenging Counter Circuit layout$(a)$. To increase coordination complexity, liang2024 introduced the Multi-Strategy Counter layout$(d)$, where players could choose between tomato and onion ingredients to cook soup. This adds an additional coordination challenge where agents have to adapt to soup-making strategies based on different ingredients. $(b) \text{ \& } (e)$ are evaluation of GOAT against a human proxy model trained with BC. We compare to 4 baselines, which are trained using simulated population data, including the previous state-of-the-art method, GAMMA liang2024. Error bars show the std. err. over 5 random seeds. $(c) \text{ \& } (f)$ shows the evaluation of the performance of methods when tested against real humans in two layouts: counter circuit and multi-strategy counter, respectively.
  • Figure 5: $a)$ GOAT is exploring the latent space over training episodes, compared to red standard normal distribution cluster of GAMMA. $b)$ Comparison of Regret (Blue) with Minimax (Red) in the latent space.
  • ...and 5 more figures