Table of Contents
Fetching ...

Multi-Agent Inverse Q-Learning from Demonstrations

Nathaniel Haynam, Adam Khoja, Dhruv Kumar, Vivek Myers, Erdem Bıyık

TL;DR

The paper tackles reward misspecification in multi-agent IRL by introducing Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), which learns per-agent rewards using marginalized critics $\bar{Q}_{\psi_i}$ and models policies as generalized Boltzmann distributions $\pi_i(a_i|s) \propto \exp(\lambda \bar{Q}_{\psi_i}(s,a_i))$. Building on an IQ-Learn–style objective, the method jointly optimizes critics with a concave regularizer and recovers agent-specific rewards through a constrained regression, enabling offline and online training. Experiments on Gems, Overcooked, and Highway-Env show that MAMQL delivers higher average returns, faster convergence (roughly 2–4×), and substantially improved reward recovery (3–300×) compared to MAIRL and IL baselines, while maintaining low behavioral error. The approach advances robust, sample-efficient learning of multi-agent objectives from demonstrations, with practical implications for complex domains like autonomous driving and collaborative tasks, though future work should address human demonstration biases and safety concerns for real-world deployment.

Abstract

When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamql .

Multi-Agent Inverse Q-Learning from Demonstrations

TL;DR

The paper tackles reward misspecification in multi-agent IRL by introducing Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), which learns per-agent rewards using marginalized critics and models policies as generalized Boltzmann distributions . Building on an IQ-Learn–style objective, the method jointly optimizes critics with a concave regularizer and recovers agent-specific rewards through a constrained regression, enabling offline and online training. Experiments on Gems, Overcooked, and Highway-Env show that MAMQL delivers higher average returns, faster convergence (roughly 2–4×), and substantially improved reward recovery (3–300×) compared to MAIRL and IL baselines, while maintaining low behavioral error. The approach advances robust, sample-efficient learning of multi-agent objectives from demonstrations, with practical implications for complex domains like autonomous driving and collaborative tasks, though future work should address human demonstration biases and safety concerns for real-world deployment.

Abstract

When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamql .

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: On the left, the proposed multi-agent grid world environment where above square agents are working together to get a purple circular gem. At the center, the Overcooked cramped environment with one agent adding onions to the soup and the other agent ready with another. On the right, a four agent intersection from the bottom agent's viewpoint. The weighted line marks priority level of collision avoidance with another agent.
  • Figure 2: Reward of recovered policy for MAMQL and baselines, across varying dataset sizes.
  • Figure 3: Reward recovery (MSE between true and predicted rewards) across training trajectories of the MAIRL algorithms.
  • Figure 4: Behavioral error across trajectories while training the MAIRL algorithms.