Table of Contents
Fetching ...

A Minimax Approach to Ad Hoc Teamwork

Victor Villin, Thomas Kleine Buening, Christos Dimitrakakis

TL;DR

This work tackles Ad Hoc Teamwork under partner uncertainty by reframing AHT as a Minimax-Bayes Reinforcement Learning problem over a finite background population of partner policies. By optimizing a focal policy against the worst-case prior over training scenarios, the approach yields strong worst-case guarantees and improved out-of-distribution robustness, demonstrated on tasks like Collaborative Cooking and Iterated Prisoner's Dilemma. The authors compare utility- and regret-based objective formulations, introduce a Gradient Descent-Ascent training algorithm for softmax policies, and show that minimax-distribution training can accelerate learning while improving robustness to unseen teammates. The findings highlight the critical role of the training-partner distribution in achieving robust AHT, with practical implications for curriculum-like scenario generation and robust coordination in multi-agent systems. The work advances robust AHT by providing theoretical guarantees, an actionable training methodology, and empirical evidence of improved performance across simple and deep RL coordination tasks.

Abstract

We propose a minimax-Bayes approach to Ad Hoc Teamwork (AHT) that optimizes policies against an adversarial prior over partners, explicitly accounting for uncertainty about partners at time of deployment. Unlike existing methods that assume a specific distribution over partners, our approach improves worst-case performance guarantees. Extensive experiments, including evaluations on coordinated cooking tasks from the Melting Pot suite, show our method's superior robustness compared to self-play, fictitious play, and best response learning. Our work highlights the importance of selecting an appropriate training distribution over teammates to achieve robustness in AHT.

A Minimax Approach to Ad Hoc Teamwork

TL;DR

This work tackles Ad Hoc Teamwork under partner uncertainty by reframing AHT as a Minimax-Bayes Reinforcement Learning problem over a finite background population of partner policies. By optimizing a focal policy against the worst-case prior over training scenarios, the approach yields strong worst-case guarantees and improved out-of-distribution robustness, demonstrated on tasks like Collaborative Cooking and Iterated Prisoner's Dilemma. The authors compare utility- and regret-based objective formulations, introduce a Gradient Descent-Ascent training algorithm for softmax policies, and show that minimax-distribution training can accelerate learning while improving robustness to unseen teammates. The findings highlight the critical role of the training-partner distribution in achieving robust AHT, with practical implications for curriculum-like scenario generation and robust coordination in multi-agent systems. The work advances robust AHT by providing theoretical guarantees, an actionable training methodology, and empirical evidence of improved performance across simple and deep RL coordination tasks.

Abstract

We propose a minimax-Bayes approach to Ad Hoc Teamwork (AHT) that optimizes policies against an adversarial prior over partners, explicitly accounting for uncertainty about partners at time of deployment. Unlike existing methods that assume a specific distribution over partners, our approach improves worst-case performance guarantees. Extensive experiments, including evaluations on coordinated cooking tasks from the Melting Pot suite, show our method's superior robustness compared to self-play, fictitious play, and best response learning. Our work highlights the importance of selecting an appropriate training distribution over teammates to achieve robustness in AHT.

Paper Structure

This paper contains 15 sections, 5 theorems, 16 equations, 1 figure.

Key Result

corollary 1

For an $m$-player POMG $\mu$ in a finite state-action space, with a known reward function and a finite horizon, and a background population $\mathcal{B}$, the maximin game eq:mbmarl.maximin has a value:

Figures (1)

  • Figure 1: Illustration of the framework used in this paper. Prior to training the focal policy $\pi$, background policies with different preferences ($\lambda_i, \delta_i$) learn by interacting within sub-populations of varying sizes. These sub-populations are then combined to form a background population, $\mathcal{B}^\text{train}$, used as a common ‘training dataset’ for all algorithms. Our primary focus is on the training phase, where the focal policy $\pi$ is trained while the distribution $\beta$ over scenarios is tuned according to the proposed minimax game. These scenarios mix copies of $\pi$ with policies from $\mathcal{B}^\text{train}$, where the self-play scenario $\sigma^\text{SP}$ has the policy interacting only with copies of itself.

Theorems & Definitions (7)

  • corollary 1: buening_minimax_bayes_reinforcement_2023
  • corollary 2: buening_minimax_bayes_reinforcement_2023
  • lemma 1
  • definition 1: Non-degenerative population
  • lemma 2
  • definition 2: $\epsilon$-net of a scenario set
  • lemma 3