Rate optimal learning of equilibria from data

Till Freihaut; Luca Viano; Emanuele Nevali; Volkan Cevher; Matthieu Geist; Giorgia Ramponi

Rate optimal learning of equilibria from data

Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

TL;DR

This paper addresses learning equilibria in two-player zero-sum Markov games from demonstrations, distinguishing non-interactive and interactive MAIL. It establishes a tight negative result for non-interactive MAIL: even when ${\mathcal{C}}(\mu^{\operatorname{E}},\nu^{\operatorname{E}})$ is finite, any algorithm requires $N = \Omega\left( \frac{{\mathcal{C}}_{\max}}{\varepsilon^2} \right)$ samples, and shows that Behavior Cloning is rate-optimal in this setting. For the interactive regime, the authors introduce MAIL-WARM, which combines reward-free warm-up with behavioral cloning and achieves a near-optimal $\mathcal{O}(\varepsilon^{-2})$ sample complexity, independent of ${\mathcal{C}}_{\max}$, matching the derived lower bound. Theoretical guarantees are complemented by numerical results in grid-world and lower-bound constructions, illustrating the limits of BC when concentrability is unbounded and the practical advantage of MAIL-WARM over prior interactive MAIL methods.

Abstract

We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.

Rate optimal learning of equilibria from data

TL;DR

Abstract

Rate optimal learning of equilibria from data

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (24)