POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding
Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, Aleksandr Panov
TL;DR
POGEMA tackles the lack of a unified benchmark for cooperative MAPF by delivering a fast, Python-based environment, a problem-instance generator, a visualization toolkit, and a benchmarking suite with a domain-specific evaluation protocol. It enables fair comparisons across pure MARL, hybrid, and planning-based methods and provides a diverse set of baselines including centralized planners like LaCAM and RHCR, as well as state-of-the-art hybrids SCRIMP and DCC. Across MAPF and Lifelong MAPF, centralized planners frequently lead on key metrics, with hybrids maintaining a strong edge over pure MARL in many scenarios; MARL approaches can match or exceed some baselines under Lifelong MAPF, illustrating the value of shared information and planning components. The platform’s procedural map generation, detailed metrics, and distributed evaluation capabilities offer a practical means to study generalization, scalability, and coordination, with potential impact on real-world robotics and warehouse automation research.
Abstract
Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments, typically involving a small number of agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot pathfinding, which have traditionally been approached with classical non-learnable methods (e.g., heuristic search), are now being suggested for solution using learning-based or hybrid methods. However, in this domain, it remains difficult, if not impossible, to conduct a fair comparison between classical, learning-based, and hybrid approaches due to the lack of a unified framework that supports both learning and evaluation. To address this, we introduce POGEMA, a comprehensive set of tools that includes a fast environment for learning, a problem instance generator, a collection of predefined problem instances, a visualization toolkit, and a benchmarking tool for automated evaluation. We also introduce and define an evaluation protocol that specifies a range of domain-related metrics, computed based on primary evaluation indicators (such as success rate and path length), enabling a fair multi-fold comparison. The results of this comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
