B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency
Wenjing Zhang, Wei Zhang, Wenqing Hu, Yifan Wang
TL;DR
B2MAPO introduces Batch-by-Batch Multi-Agent Policy Optimization to balance performance and efficiency in cooperative MARL by partitioning agent policies into update batches and updating them sequentially with off-policy corrections. It provides a monotonic-improvement bound for joint and individual policies and presents a two-layer CTDE-compatible framework: a batch-sequence generation layer and a batch-by-batch optimization layer. A concrete DAG-based implementation yields a DAG generator and DAG critic that produce optimal batch sequences and accurate advantages, while a derived independent policy $\boldsymbol{\pi}_{ind}$ is trained via MAPPO to maintain CTDE efficiency. Empirical results on SMAC and GRF show that B2MAPO achieves superior performance with substantial reductions in both training and execution time compared with strong baselines, demonstrating practical scalability and applicability to large cooperative agent systems.
Abstract
Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60.4% and 78.7%, respectively.
