Table of Contents
Fetching ...

O-MAPL: Offline Multi-agent Preference Learning

The Viet Bui, Tien Mai, Hong Thanh Nguyen

TL;DR

The paper addresses the challenge of inferring rewards in multi-agent reinforcement learning by proposing O-MAPL, an offline, end-to-end preference-based learning framework that directly learns the global Q-function under the CTDE paradigm, thereby bypassing explicit reward recovery. It leverages a value-decomposition (mixing) network to factor global Q and V into local components and uses a convexity-friendly, single-layer mixing approach along with an extreme-V update to ensure log-sum-exp consistency. Local policy extraction is achieved via a weighted behavior cloning scheme that preserves global-local consistency regardless of mixing depth, with theoretical guarantees and practical training stability. Empirical results on SMAC and MaMuJoCo demonstrate that O-MAPL outperforms strong baselines across rule-based and LLM-generated preference data, often with faster convergence and better robustness, highlighting the method’s practicality for complex cooperative MARL tasks.

Abstract

Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL), where large joint state-action spaces and complex inter-agent interactions complicate the task. While prior single-agent studies have explored recovering reward functions and policies from human preferences, similar work in MARL is limited. Existing methods often involve separate stages of supervised reward learning and MARL algorithms, leading to unstable training. In this work, we introduce a novel end-to-end preference-based learning framework for cooperative MARL, leveraging the underlying connection between reward functions and soft Q-functions. Our approach uses a carefully-designed multi-agent value decomposition strategy to improve training efficiency. Extensive experiments on SMAC and MAMuJoCo benchmarks show that our algorithm outperforms existing methods across various tasks.

O-MAPL: Offline Multi-agent Preference Learning

TL;DR

The paper addresses the challenge of inferring rewards in multi-agent reinforcement learning by proposing O-MAPL, an offline, end-to-end preference-based learning framework that directly learns the global Q-function under the CTDE paradigm, thereby bypassing explicit reward recovery. It leverages a value-decomposition (mixing) network to factor global Q and V into local components and uses a convexity-friendly, single-layer mixing approach along with an extreme-V update to ensure log-sum-exp consistency. Local policy extraction is achieved via a weighted behavior cloning scheme that preserves global-local consistency regardless of mixing depth, with theoretical guarantees and practical training stability. Empirical results on SMAC and MaMuJoCo demonstrate that O-MAPL outperforms strong baselines across rule-based and LLM-generated preference data, often with faster convergence and better robustness, highlighting the method’s practicality for complex cooperative MARL tasks.

Abstract

Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL), where large joint state-action spaces and complex inter-agent interactions complicate the task. While prior single-agent studies have explored recovering reward functions and policies from human preferences, similar work in MARL is limited. Existing methods often involve separate stages of supervised reward learning and MARL algorithms, leading to unstable training. In this work, we introduce a novel end-to-end preference-based learning framework for cooperative MARL, leveraging the underlying connection between reward functions and soft Q-functions. Our approach uses a carefully-designed multi-agent value decomposition strategy to improve training efficiency. Extensive experiments on SMAC and MAMuJoCo benchmarks show that our algorithm outperforms existing methods across various tasks.

Paper Structure

This paper contains 45 sections, 5 theorems, 57 equations, 12 figures, 16 tables, 1 algorithm.

Key Result

Proposition 4.1

The loss $\mathcal{L}(\mathbf{q}, \mathbf{v}, w)$ is concave in $\mathbf{q}$ and $w$ (the parameters of the mixing networks), while the extreme-V loss function $\mathcal{J}(\mathbf{v})$ is convex in $\mathbf{v}$.

Figures (12)

  • Figure 1: Evaluation curves (in win rates) of our O-MAPL for SMACv2 with rule-based preference data.
  • Figure 2: Evaluation curves (in returns) on MaMujoco tasks
  • Figure 3: Plot of the function $f(t) = e^{1 - e^{t}} + e^{t} - 1$.
  • Figure 4: Evaluation curves (returns) for MAMujoco tasks with Rule-based preference data.
  • Figure 5: Evaluation curves (returns) for SMACv1 tasks with Rule-based preference data.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Proposition 4.1: Convexity
  • Proposition 4.2: Non-convexity under two-layer mixing networks
  • Theorem 4.3: Global-Local Consistency (GLC)
  • Theorem 4.4
  • Proposition 4.5
  • proof
  • proof
  • proof
  • proof
  • proof