Table of Contents
Fetching ...

Rate optimal learning of equilibria from data

Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

TL;DR

This paper addresses learning equilibria in two-player zero-sum Markov games from demonstrations, distinguishing non-interactive and interactive MAIL. It establishes a tight negative result for non-interactive MAIL: even when ${\mathcal{C}}(\mu^{\operatorname{E}},\nu^{\operatorname{E}})$ is finite, any algorithm requires $N = \Omega\left( \frac{{\mathcal{C}}_{\max}}{\varepsilon^2} \right)$ samples, and shows that Behavior Cloning is rate-optimal in this setting. For the interactive regime, the authors introduce MAIL-WARM, which combines reward-free warm-up with behavioral cloning and achieves a near-optimal $\mathcal{O}(\varepsilon^{-2})$ sample complexity, independent of ${\mathcal{C}}_{\max}$, matching the derived lower bound. Theoretical guarantees are complemented by numerical results in grid-world and lower-bound constructions, illustrating the limits of BC when concentrability is unbounded and the practical advantage of MAIL-WARM over prior interactive MAIL methods.

Abstract

We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.

Rate optimal learning of equilibria from data

TL;DR

This paper addresses learning equilibria in two-player zero-sum Markov games from demonstrations, distinguishing non-interactive and interactive MAIL. It establishes a tight negative result for non-interactive MAIL: even when is finite, any algorithm requires samples, and shows that Behavior Cloning is rate-optimal in this setting. For the interactive regime, the authors introduce MAIL-WARM, which combines reward-free warm-up with behavioral cloning and achieves a near-optimal sample complexity, independent of , matching the derived lower bound. Theoretical guarantees are complemented by numerical results in grid-world and lower-bound constructions, illustrating the limits of BC when concentrability is unbounded and the practical advantage of MAIL-WARM over prior interactive MAIL methods.

Abstract

We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from to matching the dependence on implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.

Paper Structure

This paper contains 30 sections, 14 theorems, 99 equations, 4 figures, 1 table, 4 algorithms.

Key Result

Theorem 3.1

Let $\hat{\mu},\hat{\nu}$ be the output of a non-interactive MAIL algorithm $\mathrm{Alg}$. Then, for any $\mathrm{Alg}$, there exists a Markov game such that satisfying $\mathbb{E}_{\mathrm{Alg}}\left[{\left\langle{d_0},{V^{\mu^{\star} , \widehat{\nu} } - V^{ \widehat{\mu}, \nu^{\star} }}\right\r

Figures (4)

  • Figure 1: Markov game instance used for the Lower bound.
  • Figure 2: Exploitability of BC in the lower-bound Markov game and comparison of imitation learning algorithms in Gridworlds for a pure NE expert (Gridworld 1) and a mixed one (Gridworld 2).
  • Figure 3: Markov game instance used for the Lower bound.
  • Figure 4: Zero-sum Gridworld environment and different Nash equilibrium paths.

Theorems & Definitions (24)

  • Theorem 3.1
  • Corollary 3.1
  • Theorem 3.2
  • Definition 6.1: Expert Induced MDP
  • Theorem 6.1
  • Lemma 6.1: Exploitability decomposition
  • Theorem B.1
  • Corollary B.1
  • proof
  • Lemma B.1
  • ...and 14 more