Table of Contents
Fetching ...

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

Wenjing Zhang, Wei Zhang, Wenqing Hu, Yifan Wang

TL;DR

B2MAPO introduces Batch-by-Batch Multi-Agent Policy Optimization to balance performance and efficiency in cooperative MARL by partitioning agent policies into update batches and updating them sequentially with off-policy corrections. It provides a monotonic-improvement bound for joint and individual policies and presents a two-layer CTDE-compatible framework: a batch-sequence generation layer and a batch-by-batch optimization layer. A concrete DAG-based implementation yields a DAG generator and DAG critic that produce optimal batch sequences and accurate advantages, while a derived independent policy $\boldsymbol{\pi}_{ind}$ is trained via MAPPO to maintain CTDE efficiency. Empirical results on SMAC and GRF show that B2MAPO achieves superior performance with substantial reductions in both training and execution time compared with strong baselines, demonstrating practical scalability and applicability to large cooperative agent systems.

Abstract

Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60.4% and 78.7%, respectively.

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

TL;DR

B2MAPO introduces Batch-by-Batch Multi-Agent Policy Optimization to balance performance and efficiency in cooperative MARL by partitioning agent policies into update batches and updating them sequentially with off-policy corrections. It provides a monotonic-improvement bound for joint and individual policies and presents a two-layer CTDE-compatible framework: a batch-sequence generation layer and a batch-by-batch optimization layer. A concrete DAG-based implementation yields a DAG generator and DAG critic that produce optimal batch sequences and accurate advantages, while a derived independent policy is trained via MAPPO to maintain CTDE efficiency. Empirical results on SMAC and GRF show that B2MAPO achieves superior performance with substantial reductions in both training and execution time compared with strong baselines, demonstrating practical scalability and applicability to large cooperative agent systems.

Abstract

Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60.4% and 78.7%, respectively.
Paper Structure (18 sections, 3 theorems, 33 equations, 5 figures, 4 tables)

This paper contains 18 sections, 3 theorems, 33 equations, 5 figures, 4 tables.

Key Result

Theorem 1

Given $(B,\prec)$, let $\alpha^{b_k}=D_{TV}^{\max}(\pi^{b_k}\|\hat{\pi}^{b_k})$, $\epsilon\!=\!\max_{b_k}\epsilon^{b_k}$, $\epsilon^{b_k}=\max_{s,\boldsymbol{a}}|A^{\hat{\boldsymbol{\pi}}^{b_{k-1}}}(s,\boldsymbol{a})|$, and $\xi^{b_k}=\max_{s,\boldsymbol{a}}|A^{\boldsymbol{\pi},\hat{\boldsymbol{\pi}

Figures (5)

  • Figure 1: The taxonomy of different policy rollout and update schemes. The joint policy at the thick red arrow tail is used for rollout, and the sampled data is used to update the joint policy pointed by the arrow. The thin black arrow indicates that the old joint policy at the arrow tail is updated to become the new joint policy pointed by the arrow. For batch-by-batch policy optimization, $\boldsymbol{\pi}^{b_k}$ represents the joint policy consisting of the policies of multiple agents in batch ${b_k}$.$\hat{\boldsymbol{\pi}}^{b_k}$ means the updated joint policy after batch ${b_k}$ is updated.
  • Figure 2: The structure of B2MAPO framework.
  • Figure 3: The network architecture of the DAG-based B2MAPO algorithm.
  • Figure 4: The comparison of performance between B2MAPO and other baselines in SMAC.
  • Figure 5: The comparison of performance between B2MAPO and other baselines in GRF.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3