Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Antoine Bergerault; Volkan Cevher; Negar Mehr

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Antoine Bergerault, Volkan Cevher, Negar Mehr

TL;DR

This paper demonstrates impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games, and demonstrates a new hardness result on characterizing the Nash gap given a fixed measure matching error.

Abstract

Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $ε_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for a discount factor $γ$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

TL;DR

This paper demonstrates impossibility and hardness results of learning low-exploitable policies in general

-player Markov Games, and demonstrates a new hardness result on characterizing the Nash gap given a fixed measure matching error.

Abstract

-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error

, this provides a Nash imitation gap of

for a discount factor

. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

Paper Structure (46 sections, 26 theorems, 52 equations, 6 figures, 1 algorithm)

This paper contains 46 sections, 26 theorems, 52 equations, 6 figures, 1 algorithm.

Introduction
Previous work
Single-agent Imitation Learning.
Multi-agent Imitation Learning.
Theoretical Barriers for MA-IL.
Preliminaries
Markov Games
Measuring optimality in games
Offline Imitation learning
Impossibility results for exact measure matching
Sufficiency of state-action matching under full-state support
Insufficiency of state-only matching with full-state support
Insufficiency of state-action measure matching with unvisited states
On the Infeasibility of Tractable Lower Bounds for Exploitability
Tractable and Consistent Exploitability Upper Bounds from Best-Response Continuity
...and 31 more sections

Key Result

Theorem 1

Let $\pi, \pi' \in \Pi$ be such that $\rho_{\pi} = \rho_{\pi'}$. Then, $\mathcal{S}^+_{\pi} = \mathcal{S}^+_{\pi'}$ and $\pi(\cdot | s) = \pi'(\cdot | s)$ for every $s \in \mathcal{S}^+_{\pi}$.

Figures (6)

Figure 1: Cooperative two-player game $G$: the transitions and rewards are described by the left and right sub-figures, respectively. In this game, there exists a Nash equilibrium $\pi^E$ with full-state support and a policy $\pi$ such that $\mu_{\pi^E} = \mu_{\pi}$ with a Nash gap linear in the effective horizon $1/(1-\gamma)$.
Figure 2: Transitions of a two-player Markov Game. The unique initial state is $s_1$. The rest of the chain ($\cdots$) and reward functions can be designed to induce linear Nash gap for state-action matching.
Figure 3: Deterministic transition dynamics of a two-player game $G$, with states $s_0$, $s_{\text{exp}}$ and sub-Markov chain $M_k$. Player 1 has action space $\mathcal{A}_1 = \{a^r_1, a^r_2\}$ and player 2 has action space $\mathcal{A}_2 = \{a^c_1, a^c_2\}$.
Figure 4: Transition dynamics for the two-player game. $s_0$ is the initial state. The top branch contains all odd states while the bottom branch together with $s_0$ comprises all even states.
Figure 5: Evolution of the tight Nash gap upper bound and the tight $\delta$ function with the behavior cloning error. This shows how the delta function and Nash gap tend to increase together in the simple Tag-Game environment.
...and 1 more figures

Theorems & Definitions (55)

Definition 1: Best-response mapping
Definition 2: Nash equilibrium
Definition 3: Value gap
Definition 4: Nash gap, see ramponi2023imitationmeanfieldgames
Theorem 1
Corollary 1
Lemma 1
proof
Theorem 2
Definition 5: Tight Nash gap lower bound
...and 45 more

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

TL;DR

Abstract

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (55)