Table of Contents
Fetching ...

Multi-agent Coordination via Flow Matching

Dongsu Lee, Daehee Lee, Amy Zhang

TL;DR

This paper introduces MAC-Flow, an offline MARL framework that jointly models multi-agent coordination and computational efficiency. It first learns a flow-based joint policy via flow matching on offline data, then distills this into decentralized one-step policies guided by the IGM principle and Q-learning, enabling fast, real-time execution. The approach achieves strong coordination across discrete and continuous benchmarks, with approximately $\times14.5$ faster inference than diffusion-based MARL methods while maintaining competitive performance, and exhibits compatibility with online fine-tuning. Theoretical guarantees connect distillation quality to the policy and value gaps, and empirical results demonstrate substantial speedups with minimal loss in effectiveness, highlighting practical benefits for scalable multi-agent systems.

Abstract

This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: (i) a rich representation of the diverse joint behaviors present in offline data and (ii) the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, i.e., denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including $12$ environments and $34$ datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about $\boldsymbol{\times14.5}$ faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline multi-agent reinforcement learning (MARL) methods.

Multi-agent Coordination via Flow Matching

TL;DR

This paper introduces MAC-Flow, an offline MARL framework that jointly models multi-agent coordination and computational efficiency. It first learns a flow-based joint policy via flow matching on offline data, then distills this into decentralized one-step policies guided by the IGM principle and Q-learning, enabling fast, real-time execution. The approach achieves strong coordination across discrete and continuous benchmarks, with approximately faster inference than diffusion-based MARL methods while maintaining competitive performance, and exhibits compatibility with online fine-tuning. Theoretical guarantees connect distillation quality to the policy and value gaps, and empirical results demonstrate substantial speedups with minimal loss in effectiveness, highlighting practical benefits for scalable multi-agent systems.

Abstract

This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: (i) a rich representation of the diverse joint behaviors present in offline data and (ii) the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, i.e., denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including environments and datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline multi-agent reinforcement learning (MARL) methods.

Paper Structure

This paper contains 33 sections, 3 theorems, 17 equations, 12 figures, 3 tables, 2 algorithms.

Key Result

Proposition 4.2

Fix a joint observation $\mathbf{o}$. Let $\mathbf{z}\sim p_0$ be a noise variable, and define the joint policy mapping $\mu_\phi(\mathbf{o},\mathbf{z})\in\mathcal{A}$ and the factorized policy mapping $\mu_{\mathbf{w}}(\mathbf{o},\mathbf{z})=[\mu_{w_1}(o_1,z_1),\dots,\mu_{w_I}(o_I,z_I)]\in\mathcal{

Figures (12)

  • Figure 1: Summary of results. This summarizes performance vs. inference speed for selected algorithms on widely-used MARL benchmarks, SMACv1 and SMACv2. We plot aggregate mean performance and inference time across $18$ datasets for $8$ scenarios related to the SMAC maps. More precisely, we measure inference time based on the total computation performed by each algorithm and report it by using milliseconds (ms) unit and $\log$ scale, where a higher value indicates greater computational cost. As a result, our proposed solution, MAC-Flow, achieves $\times14.5$ faster inference speed on average with comparable performance compared to previous SOTA.
  • Figure 2: Overview diagram of proposed solution. Our solution, MAC-Flow, composes of two stages. The first stage models the joint action distribution via flow-matching to capture inter-agent dependencies, thereby facilitating the extraction of coordination behaviors more effectively than treating individual policies. For the next stage, individual critics are trained under the individual-global-max principle, thereby embedding behaviors for multi-agent coordination. At the second stage, practicality is highlighted by deriving individual policies for decentralized execution from a flow-based joint policy via Q maximization and BC distillation.
  • Figure 3: Inference time. These results are averaged over each benchmark's scenarios.
  • Figure 4: Offline-to-online experiments. Online fine-tuning starts at $0.5$ normalized steps.
  • Figure 5: Ablation study for RQ4. We test the effect of the distillation phase and its RL objective.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Definition 4.1: Action distribution identical matching
  • Proposition 4.2: 2-Wasserstein upper bound of distillation
  • Proposition 4.3: Lipschitz value gap bound
  • Lemma D.1: Comparability of Joint and Factorized Policies