Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Zhuoran Li; Hai Zhong; Xun Wang; Qingxin Xia; Lihua Zhang; Longbo Huang

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

TL;DR

This work proposes among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination, with a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood.

Abstract

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

TL;DR

Abstract

diverse tasks, demonstrating a remarkable

improvement in sample efficiency.

Paper Structure (27 sections, 1 theorem, 22 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 1 theorem, 22 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Online Multi-Agent Reinforcement Learning
Diffusion Policies in Reinforcement Learning
Preliminary
MARL and Efficient Value Estimation
Diffusion policy
Bridge to OMAD and Theoretical Insight
The OMAD Method
Decentralized Diffusion Policy Formulation
Online Centralized Training the Diffusion Policy
Algorithm and Discussion
Experiments
Experiment Setup
Experiment Results
...and 12 more sections

Key Result

Theorem 1

(Entropy Lower Bound for Decentralized Diffusion Policies) Given that the joint policy is factorized into independent diffusion processes, the entropy of the joint distribution $\mathcal{H}(\bm{\pi_{\theta}}(a|s))$ is lower-bounded by the sum of individual variational bounds:

Figures (8)

Figure 1: The CTDE framework of OMAD. The left panel illustrates Decentralized Execution, where agents independently sample actions via a denoising diffusion process. The right panel depicts Centralized Training, where a shared Distributional Critic provides unified Value Guidance to jointly optimize policies, stabilized by adaptive regularization for the entropy evidence lower bound.
Figure 2: Learning curves comparing OMAD with state-of-the-art online MARL baselines (HATD3 and HASAC) and two representative extensions of the diffusion policies (MADPMD and MASDAC) on MPE and MAMuJoCo benchmarks. The plots report the average episode return over training steps, averaged across 5 random seeds, with shaded regions indicating one standard deviation. Results demonstrate that OMAD consistently achieves faster convergence and superior final performance across both low-dimensional MPE tasks and high-dimensional continuous control environments.
Figure 3: State coverage comparison on representative dimensions ($1$ and $21$) at $250$k steps. We visualize the state occupancy within the replay buffers for HATD3, HASAC, and OMAD. Colored regions (red, green, blue and orange) indicate visited states. OMAD achieves the broadest coverage, where orange regions are uniquely explored by OMAD, demonstrating superior exploration.
Figure 4: Ablation study on Distributional Q-function hyperparameters. Left: The sensitivity of the agent's performance to the support upper bound $V_{\max}$. Right: The impact of the discretization resolution on performance for different number of atoms.
Figure 5: Ablation study on the number of denoising steps. Left: Episode return curves during training for varying denoising steps. Right: The trade-off between computational cost (training/inference time) and the number of steps.
...and 3 more figures

Theorems & Definitions (2)

Theorem 1
proof

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

TL;DR

Abstract

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)