Table of Contents
Fetching ...

Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning

Till Freihaut, Luca Viano, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

TL;DR

This work tackles the problem of learning Nash equilibria from expert data in two-player zero-sum Markov games by establishing the first expert-sample complexity results for imitation-based equilibrium learning. It shows that non-interactive approaches, such as Behavioral Cloning, incur fundamental dependence on a single-policy deviation concentrability coefficient ${\mathcal{C}}_{\max}$, which can be unbounded and make equilibrium recovery impossible in some MGs. To overcome this, the authors introduce two interactive algorithms: MAIL-BRO, which uses a Best Response Oracle and achieves an $\\varepsilon$-NE with $\\mathcal{O}(\\varepsilon^{-4})$ expert queries, and MURMAIL, which forgoes the BR oracle with a maximum-uncertainty strategy and attains an $\\varepsilon$-NE with $\\widetilde{\\mathcal{O}}( |\\mathcal{S}|^4 |\\mathcal{A}_{\max}|^5 (1-\\gamma)^{-12} \\\varepsilon^{-8} )$ queries. The paper also provides a lower-bound construction showing the necessity of concentrability in the non-interactive setting and presents numerical results that validate the theory and illustrate the trade-offs between BC and the interactive methods. Overall, the results highlight that interactive information is crucial to efficiently learn strategic equilibria in multi-agent imitation settings, and the proposed algorithms offer practical, polynomial-time guarantees.

Abstract

This paper provides the first expert sample complexity characterization for learning a Nash equilibrium from expert data in Markov Games. We show that a new quantity named the single policy deviation concentrability coefficient is unavoidable in the non-interactive imitation learning setting, and we provide an upper bound for behavioral cloning (BC) featuring such coefficient. BC exhibits substantial regret in games with high concentrability coefficient, leading us to utilize expert queries to develop and introduce two novel solution algorithms: MAIL-BRO and MURMAIL. The former employs a best response oracle and learns an $\varepsilon$-Nash equilibrium with $\mathcal{O}(\varepsilon^{-4})$ expert and oracle queries. The latter bypasses completely the best response oracle at the cost of a worse expert query complexity of order $\mathcal{O}(\varepsilon^{-8})$. Finally, we provide numerical evidence, confirming our theoretical findings.

Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning

TL;DR

This work tackles the problem of learning Nash equilibria from expert data in two-player zero-sum Markov games by establishing the first expert-sample complexity results for imitation-based equilibrium learning. It shows that non-interactive approaches, such as Behavioral Cloning, incur fundamental dependence on a single-policy deviation concentrability coefficient , which can be unbounded and make equilibrium recovery impossible in some MGs. To overcome this, the authors introduce two interactive algorithms: MAIL-BRO, which uses a Best Response Oracle and achieves an -NE with expert queries, and MURMAIL, which forgoes the BR oracle with a maximum-uncertainty strategy and attains an -NE with queries. The paper also provides a lower-bound construction showing the necessity of concentrability in the non-interactive setting and presents numerical results that validate the theory and illustrate the trade-offs between BC and the interactive methods. Overall, the results highlight that interactive information is crucial to efficiently learn strategic equilibria in multi-agent imitation settings, and the proposed algorithms offer practical, polynomial-time guarantees.

Abstract

This paper provides the first expert sample complexity characterization for learning a Nash equilibrium from expert data in Markov Games. We show that a new quantity named the single policy deviation concentrability coefficient is unavoidable in the non-interactive imitation learning setting, and we provide an upper bound for behavioral cloning (BC) featuring such coefficient. BC exhibits substantial regret in games with high concentrability coefficient, leading us to utilize expert queries to develop and introduce two novel solution algorithms: MAIL-BRO and MURMAIL. The former employs a best response oracle and learns an -Nash equilibrium with expert and oracle queries. The latter bypasses completely the best response oracle at the cost of a worse expert query complexity of order . Finally, we provide numerical evidence, confirming our theoretical findings.

Paper Structure

This paper contains 43 sections, 19 theorems, 126 equations, 4 figures, 1 table, 4 algorithms.

Key Result

Theorem 3.1

Let $(\mu^E, \nu^E)$ denote a Nash equilibrium policy pair in a two-player zero-sum Markov game, and let ${\mathcal{D}}$ contain trajectories from this expert policy pair. Let $(\widehat{\mu}, \widehat{\nu})$ be the policies obtained via Behavior Cloning from ${\mathcal{D}}$ of size $N$. Then, with where $C_{\max} = \max_{\mu,\nu} \max \left\{\max_{\nu^\star \in \mathrm{br}(\mu)} \left\| {\frac{d

Figures (4)

  • Figure 1: 2 Player Zero-Sum Game with Linear Regret in case of full knowledge of transition.
  • Figure 2: Empirical evaluation for environments with different ${\mathcal{C}}(\mu^{\operatorname{E}},\nu^{\operatorname{E}})$.
  • Figure 3: Cooperative Markov Game with Linear Regret in case of unknown transitions
  • Figure 4: Nash Gap for MURMAIL and BC

Theorems & Definitions (37)

  • Theorem 3.1
  • Remark 3.1
  • Theorem 3.2: Construction of MG
  • Definition 4.1: Best Response Oracle
  • Theorem 4.1
  • Theorem 4.2
  • Remark 4.1
  • Lemma C.1
  • proof
  • proof : Proof of \ref{['thm:lowerbound']}
  • ...and 27 more