Multi-Agent Inverse Q-Learning from Demonstrations
Nathaniel Haynam, Adam Khoja, Dhruv Kumar, Vivek Myers, Erdem Bıyık
TL;DR
The paper tackles reward misspecification in multi-agent IRL by introducing Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), which learns per-agent rewards using marginalized critics $\bar{Q}_{\psi_i}$ and models policies as generalized Boltzmann distributions $\pi_i(a_i|s) \propto \exp(\lambda \bar{Q}_{\psi_i}(s,a_i))$. Building on an IQ-Learn–style objective, the method jointly optimizes critics with a concave regularizer and recovers agent-specific rewards through a constrained regression, enabling offline and online training. Experiments on Gems, Overcooked, and Highway-Env show that MAMQL delivers higher average returns, faster convergence (roughly 2–4×), and substantially improved reward recovery (3–300×) compared to MAIRL and IL baselines, while maintaining low behavioral error. The approach advances robust, sample-efficient learning of multi-agent objectives from demonstrations, with practical implications for complex domains like autonomous driving and collaborative tasks, though future work should address human demonstration biases and safety concerns for real-world deployment.
Abstract
When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamql .
