Table of Contents
Fetching ...

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

Arrasy Rahman, Jiaxun Cui, Peter Stone

TL;DR

The paper tackles robust ad hoc teamwork by reframing training teammate selection around the minimum coverage set $\text{MCS(E)}$, the set of best-response policies to all partner policies. It introduces L-BRDiv, a Lagrangian-based method that jointly estimates an environment's $\text{MCS(E)}$ and constructs $\Pi^{\text{train}}$ to train AHT agents via MAPPO, using learnable multipliers to balance self-play and cross-play objectives. Empirical results across four two-player cooperative environments show that L-BRDiv yields more robust AHT agents than state-of-the-art baselines (BRDiv, LIPO) by discovering more distinct MCS members and reducing hyperparameter tuning needs. The work provides a principled, environment-driven approach to adversarial diversity in multi-agent learning, with implications for deploying robust agents with unseen teammates in realistic settings.

Abstract

Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristic-based diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

TL;DR

The paper tackles robust ad hoc teamwork by reframing training teammate selection around the minimum coverage set , the set of best-response policies to all partner policies. It introduces L-BRDiv, a Lagrangian-based method that jointly estimates an environment's and constructs to train AHT agents via MAPPO, using learnable multipliers to balance self-play and cross-play objectives. Empirical results across four two-player cooperative environments show that L-BRDiv yields more robust AHT agents than state-of-the-art baselines (BRDiv, LIPO) by discovering more distinct MCS members and reducing hyperparameter tuning needs. The work provides a principled, environment-driven approach to adversarial diversity in multi-agent learning, with implications for deploying robust agents with unseen teammates in realistic settings.

Abstract

Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristic-based diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.
Paper Structure (29 sections, 14 equations, 23 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 14 equations, 23 figures, 6 tables, 1 algorithm.

Figures (23)

  • Figure 1: Best-response policies to each $\pi^{-i}\in\Pi$.
  • Figure 2: Generating $\Pi^{\text{train}}$ based on identified best-response policies.
  • Figure 3: AHT training against $\Pi^{\text{train}}$ and the expected results when dealing with previously unseen teammate policies.
  • Figure 5: Lagrangian Best Response Diversity (L-BRDiv). The L-BRDiv algorithm trains a collection of policy networks (purple and orange boxes) and Lagrange multipliers (green cells inside the black rectangle). The purple boxes represent a policy from $\{\pi^{i}\}_{i=1}^{K}\subseteq\Pi$ while the policies visualized as an orange box is from $\{\pi^{-i}\}_{i=1}^{K}\subseteq\Pi$. Estimated returns between any possible pairs of policy, $(\pi^{j},\pi^{-k}) \in (\{\pi^{i}|\pi^{i}\in\Pi\}_{i=1}^{K} \times \{\pi^{-i}|\pi^{-i}\in\Pi\}_{i=1}^{K})$, and their associated Lagrange multipliers are used to compute the optimized term in the Lagrangian dual form (right red box) via a weighted summation operation (black dotted lines connect weights and multiplied terms). The policy networks are then trained via MAPPO yu2022the to maximize this optimized term, while the Lagrange multipliers are trained to minimize the term via stochastic gradient descent.
  • Figure 6: Repeated Matrix Game.
  • ...and 18 more figures