A Quality Diversity Method to Automatically Generate Multi-Agent Path Finding Benchmark Maps
Cheng Qian, Yulun Zhang, Varun Bhatt, Matthew Christopher Fontaine, Stefanos Nikolaidis, Jiaoyang Li
TL;DR
The paper tackles biased and incomplete MAPF benchmarking by introducing a framework that uses Quality Diversity (QD) with Neural Cellular Automata (NCA) to automatically generate diverse, high-quality benchmark maps. It adapts CMA-MAE to optimize NCAs for map generation, repairs maps to validity, and evaluates them across five MAPF algorithms (CBS, EECBS, PBS, PIBT, LTF) to produce both hard maps for individual algorithms and challenging maps that widen performance gaps between algorithm pairs. The authors provide concrete setups (e.g., map size $32 \times 32$, obstacles in $[307,717]$, $N_e=5$, $N_{eval}=10{,}000$) and report distinct patterns where each algorithm excels or struggles, along with validation on additional instances. This work advances fair, comprehensive benchmarking in MAPF and offers guidelines for using the framework to stress-test or compare algorithms, albeit with substantial computational cost. A public release of representative benchmark maps is planned to facilitate broader adoption.
Abstract
We use the Quality Diversity (QD) algorithm with Neural Cellular Automata (NCA) to generate benchmark maps for Multi-Agent Path Finding (MAPF) algorithms. Previously, MAPF algorithms are tested using fixed, human-designed benchmark maps. However, such fixed benchmark maps have several problems. First, these maps may not cover all the potential failure scenarios for the algorithms. Second, when comparing different algorithms, fixed benchmark maps may introduce bias leading to unfair comparisons between algorithms. Third, since researchers test new algorithms on a small set of fixed benchmark maps, the design of the algorithms may overfit to the small set of maps. In this work, we take advantage of the QD algorithm to (1) generate maps with patterns to comprehensively understand the performance of MAPF algorithms, (2) be able to make fair comparisons between two MAPF algorithms, providing further information on the selection between two algorithms and on the design of the algorithms. Empirically, we employ this technique to generate diverse benchmark maps to evaluate and compare the behavior of different types of MAPF algorithms, including search-based, priority-based, rule-based, and learning-based algorithms. Through both single-algorithm experiments and comparisons between algorithms, we identify patterns where each algorithm excels and detect disparities in runtime or success rates between different algorithms.
