Table of Contents
Fetching ...

ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork

Caroline Wang, Arrasy Rahman, Jiaxun Cui, Yoonchang Sung, Peter Stone

TL;DR

This work reframes Ad Hoc Teamwork as an open-ended learning problem and introduces ROTATE, a regret-driven algorithm that alternates between ego-agent improvement and generating diverse teammates to probe its weaknesses. By optimizing a per-state cooperative regret and maintaining a population of past teammates, ROTATE mitigates self-sabotage and enhances generalization to unseen partners across two-player matrix games and popular coordination tasks. Empirical results show ROTATE outperforms diverse baselines, with per-state regret and population-buffer strategies central to its success. The approach offers a practical pathway to robust, zero-shot coordination in cooperative multi-agent systems, while acknowledging limitations related to scaling beyond two agents and extending theoretical analyses of regret objectives.

Abstract

Learning to collaborate with previously unseen partners is a fundamental generalization challenge in multi-agent learning, known as Ad Hoc Teamwork (AHT). Existing AHT approaches often adopt a two-stage pipeline, where first, a fixed population of teammates is generated with the idea that they should be representative of the teammates that will be seen at deployment time, and second, an AHT agent is trained to collaborate well with agents in the population. To date, the research community has focused on designing separate algorithms for each stage. This separation has led to algorithms that generate teammates with limited coverage of possible behaviors, and that ignore whether the generated teammates are easy to learn from for the AHT agent. Furthermore, algorithms for training AHT agents typically treat the set of training teammates as static, thus attempting to generalize to previously unseen partner agents without assuming any control over the set of training teammates. This paper presents a unified framework for AHT by reformulating the problem as an open-ended learning process between an AHT agent and an adversarial teammate generator. We introduce ROTATE, a regret-driven, open-ended training algorithm that alternates between improving the AHT agent and generating teammates that probe its deficiencies. Experiments across diverse two-player environments demonstrate that ROTATE significantly outperforms baselines at generalizing to an unseen set of evaluation teammates, thus establishing a new standard for robust and generalizable teamwork.

ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork

TL;DR

This work reframes Ad Hoc Teamwork as an open-ended learning problem and introduces ROTATE, a regret-driven algorithm that alternates between ego-agent improvement and generating diverse teammates to probe its weaknesses. By optimizing a per-state cooperative regret and maintaining a population of past teammates, ROTATE mitigates self-sabotage and enhances generalization to unseen partners across two-player matrix games and popular coordination tasks. Empirical results show ROTATE outperforms diverse baselines, with per-state regret and population-buffer strategies central to its success. The approach offers a practical pathway to robust, zero-shot coordination in cooperative multi-agent systems, while acknowledging limitations related to scaling beyond two agents and extending theoretical analyses of regret objectives.

Abstract

Learning to collaborate with previously unseen partners is a fundamental generalization challenge in multi-agent learning, known as Ad Hoc Teamwork (AHT). Existing AHT approaches often adopt a two-stage pipeline, where first, a fixed population of teammates is generated with the idea that they should be representative of the teammates that will be seen at deployment time, and second, an AHT agent is trained to collaborate well with agents in the population. To date, the research community has focused on designing separate algorithms for each stage. This separation has led to algorithms that generate teammates with limited coverage of possible behaviors, and that ignore whether the generated teammates are easy to learn from for the AHT agent. Furthermore, algorithms for training AHT agents typically treat the set of training teammates as static, thus attempting to generalize to previously unseen partner agents without assuming any control over the set of training teammates. This paper presents a unified framework for AHT by reformulating the problem as an open-ended learning process between an AHT agent and an adversarial teammate generator. We introduce ROTATE, a regret-driven, open-ended training algorithm that alternates between improving the AHT agent and generating teammates that probe its deficiencies. Experiments across diverse two-player environments demonstrate that ROTATE significantly outperforms baselines at generalizing to an unseen set of evaluation teammates, thus establishing a new standard for robust and generalizable teamwork.

Paper Structure

This paper contains 54 sections, 22 equations, 10 figures, 16 tables, 3 algorithms.

Figures (10)

  • Figure 1: ROTATE Overview.ROTATE is an open-ended learning framework for AHT. The core idea of ROTATE is to improve the AHT agent by iteratively generating diverse teammates with whom the AHT agent struggles to collaborate, yet not so adversarial that effective teamwork becomes impossible.
  • Figure 2: Teammate policy optimization objectives: per-trajectory regret vs per-state regret.
  • Figure 3: (Left) ROTATE outperforms all baseline methods across all tasks in evaluation return. (Right) ROTATE with per-state regret (ours) outperforms ROTATE with per-trajectory regret in $5/6$ tasks. 95% bootstrapped CI's are shown, computed across all evaluation teammates and trials.
  • Figure 4: Probability of the sabotage action at all states in the sabotage game for ROTATE teammates trained with per-state regret (ours) vs per-trajectory regret. Results are aggregated across three trials.
  • Figure 5: CoMeDi-style mixed-play objective for teammate generation, in the context of open-ended AHT.
  • ...and 5 more figures