Table of Contents
Fetching ...

OM2P: Offline Multi-Agent Mean-Flow Policy

Zhuoran Li, Xun Wang, Hai Zhong, Qingxin Xia, Lihua Zhang, Longbo Huang

TL;DR

This approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

Abstract

Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

OM2P: Offline Multi-Agent Mean-Flow Policy

TL;DR

This approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

Abstract

Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

Paper Structure

This paper contains 37 sections, 7 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the decentralized OM2P framework. OM2P illustrates a single representative agent performing scalable, one-step action generation to avoid the computational bottleneck of iterative sampling inherent to multi-agent settings.
  • Figure 2: Average episodic return measured at different training timesteps in MPE World task via expert dataset. Left: Behavior cloning performance using a Beta distribution ($\xi = [5,5,0,0]$) vs. uniform ($\xi = [0,0,0,0]$). Non-uniform weighting improves stability by emphasizing critical timesteps. Right: Performance under varying finite-difference step sizes $\Delta r$ for estimating $\frac{\text{d}u_{\theta}}{\text{d}r}$. Values $\Delta r \leq 10^{-8}$ yield results comparable to exact gradients, validating the reliability of our approximation. Details are shown in the Appendix \ref{['appendix:expresults']}.
  • Figure 3: Performance impact of the coefficient $\eta$ in MPE World.
  • Figure 4: Effect of removing key components from OM2P on the MPE World task. Each ablation variant shows reduced average returns compared to the full model, confirming the importance of each module.
  • Figure 5: Multi-agent particle environments (MPE) and Multi-agent HalfCheetah task in MuJoCo Environment (MAMuJoCo).