Table of Contents
Fetching ...

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Antoine Bergerault, Volkan Cevher, Negar Mehr

TL;DR

This paper demonstrates impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games, and demonstrates a new hardness result on characterizing the Nash gap given a fixed measure matching error.

Abstract

Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $ε_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for a discount factor $γ$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

TL;DR

This paper demonstrates impossibility and hardness results of learning low-exploitable policies in general -player Markov Games, and demonstrates a new hardness result on characterizing the Nash gap given a fixed measure matching error.

Abstract

Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general -player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error , this provides a Nash imitation gap of for a discount factor . We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.
Paper Structure (46 sections, 26 theorems, 52 equations, 6 figures, 1 algorithm)

This paper contains 46 sections, 26 theorems, 52 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

Let $\pi, \pi' \in \Pi$ be such that $\rho_{\pi} = \rho_{\pi'}$. Then, $\mathcal{S}^+_{\pi} = \mathcal{S}^+_{\pi'}$ and $\pi(\cdot | s) = \pi'(\cdot | s)$ for every $s \in \mathcal{S}^+_{\pi}$.

Figures (6)

  • Figure 1: Cooperative two-player game $G$: the transitions and rewards are described by the left and right sub-figures, respectively. In this game, there exists a Nash equilibrium $\pi^E$ with full-state support and a policy $\pi$ such that $\mu_{\pi^E} = \mu_{\pi}$ with a Nash gap linear in the effective horizon $1/(1-\gamma)$.
  • Figure 2: Transitions of a two-player Markov Game. The unique initial state is $s_1$. The rest of the chain ($\cdots$) and reward functions can be designed to induce linear Nash gap for state-action matching.
  • Figure 3: Deterministic transition dynamics of a two-player game $G$, with states $s_0$, $s_{\text{exp}}$ and sub-Markov chain $M_k$. Player 1 has action space $\mathcal{A}_1 = \{a^r_1, a^r_2\}$ and player 2 has action space $\mathcal{A}_2 = \{a^c_1, a^c_2\}$.
  • Figure 4: Transition dynamics for the two-player game. $s_0$ is the initial state. The top branch contains all odd states while the bottom branch together with $s_0$ comprises all even states.
  • Figure 5: Evolution of the tight Nash gap upper bound and the tight $\delta$ function with the behavior cloning error. This shows how the delta function and Nash gap tend to increase together in the simple Tag-Game environment.
  • ...and 1 more figures

Theorems & Definitions (55)

  • Definition 1: Best-response mapping
  • Definition 2: Nash equilibrium
  • Definition 3: Value gap
  • Definition 4: Nash gap, see ramponi2023imitationmeanfieldgames
  • Theorem 1
  • Corollary 1
  • Lemma 1
  • proof
  • Theorem 2
  • Definition 5: Tight Nash gap lower bound
  • ...and 45 more