Rate optimal learning of equilibria from data
Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi
TL;DR
This paper addresses learning equilibria in two-player zero-sum Markov games from demonstrations, distinguishing non-interactive and interactive MAIL. It establishes a tight negative result for non-interactive MAIL: even when ${\mathcal{C}}(\mu^{\operatorname{E}},\nu^{\operatorname{E}})$ is finite, any algorithm requires $N = \Omega\left( \frac{{\mathcal{C}}_{\max}}{\varepsilon^2} \right)$ samples, and shows that Behavior Cloning is rate-optimal in this setting. For the interactive regime, the authors introduce MAIL-WARM, which combines reward-free warm-up with behavioral cloning and achieves a near-optimal $\mathcal{O}(\varepsilon^{-2})$ sample complexity, independent of ${\mathcal{C}}_{\max}$, matching the derived lower bound. Theoretical guarantees are complemented by numerical results in grid-world and lower-bound constructions, illustrating the limits of BC when concentrability is unbounded and the practical advantage of MAIL-WARM over prior interactive MAIL methods.
Abstract
We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.
