Confidence-Based Curriculum Learning for Multi-Agent Path Finding

Thomy Phan; Joseph Driscoll; Justin Romberg; Sven Koenig

Confidence-Based Curriculum Learning for Multi-Agent Path Finding

Thomy Phan, Joseph Driscoll, Justin Romberg, Sven Koenig

TL;DR

This work closes the gap between multi-agent path finding (MAPF) and multi-agent reinforcement learning (MARL) by introducing CACTUS, a lightweight confidence-based reverse curriculum that places each agent’s goal within an allocation radius $R_{ extit{alloc}}$ and progressively increases difficulty when the completion rate $ ho$ meets a threshold $U$ with confidence $ abla$. By formulating MAPF as a stochastic game and employing centralized training with decentralized execution (CTDE) alongside a simple reverse curriculum, CACTUS achieves competitive performance using under $6\times 10^{5}$ trainable parameters and CPU-based training. Empirical results across maps of varying sizes and obstacle densities show CACTUS, particularly with QMIX or QPLEX critics, outperforms prior MARL approaches on learning efficiency and generalization, though centralized solvers like MAPF-LNS and CBSH still lead on some harder or highly structured layouts. The findings demonstrate that a simple, well-defined curriculum can unlock effective MARL for MAPF without heavy reward shaping or imitation data, offering a scalable foundation for future integration of MARL and MAPF.

Abstract

A wide range of real-world applications can be formulated as Multi-Agent Path Finding (MAPF) problem, where the goal is to find collision-free paths for multiple agents with individual start and goal locations. State-of-the-art MAPF solvers are mainly centralized and depend on global information, which limits their scalability and flexibility regarding changes or new maps that would require expensive replanning. Multi-agent reinforcement learning (MARL) offers an alternative way by learning decentralized policies that can generalize over a variety of maps. While there exist some prior works that attempt to connect both areas, the proposed techniques are heavily engineered and very complex due to the integration of many mechanisms that limit generality and are expensive to use. We argue that much simpler and general approaches are needed to bring the areas of MARL and MAPF closer together with significantly lower costs. In this paper, we propose Confidence-based Auto-Curriculum for Team Update Stability (CACTUS) as a lightweight MARL approach to MAPF. CACTUS defines a simple reverse curriculum scheme, where the goal of each agent is randomly placed within an allocation radius around the agent's start location. The allocation radius increases gradually as all agents improve, which is assessed by a confidence-based measure. We evaluate CACTUS in various maps of different sizes, obstacle densities, and numbers of agents. Our experiments demonstrate better performance and generalization capabilities than state-of-the-art MARL approaches with less than 600,000 trainable parameters, which is less than 5% of the neural network size of current MARL approaches to MAPF.

Confidence-Based Curriculum Learning for Multi-Agent Path Finding

TL;DR

and progressively increases difficulty when the completion rate

meets a threshold

with confidence

. By formulating MAPF as a stochastic game and employing centralized training with decentralized execution (CTDE) alongside a simple reverse curriculum, CACTUS achieves competitive performance using under

trainable parameters and CPU-based training. Empirical results across maps of varying sizes and obstacle densities show CACTUS, particularly with QMIX or QPLEX critics, outperforms prior MARL approaches on learning efficiency and generalization, though centralized solvers like MAPF-LNS and CBSH still lead on some harder or highly structured layouts. The findings demonstrate that a simple, well-defined curriculum can unlock effective MARL for MAPF without heavy reward shaping or imitation data, offering a scalable foundation for future integration of MARL and MAPF.

Abstract

Paper Structure (28 sections, 7 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 28 sections, 7 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Background
Multi-Agent Path Finding
Multi-Agent Reinforcement Learning
Policy Gradient MARL
Centralized Training Decentralized Execution (CTDE)
Curriculum Learning
Related Work
Reverse Curriculum Generation
Curriculum Learning in MARL
MARL for MAPF
MAPF as a Stochastic Game
Confidence-based Curriculum
Training Scheme
Reverse Curriculum Scheme
...and 13 more sections

Figures (8)

Figure 1: Curriculum update scheme of CACTUS. The agents (colored circles) are trained and evaluated w.r.t. a goal allocation radius $R_{\textit{alloc}}$ (shaded squares around the agents). When the average completion rate $\mu$ exceeds the decision threshold $U$ with a certain confidence level such that $\mu - \eta \sigma \geq U$, the allocation radius $R_{\textit{alloc}}$ is incremented by 1.
Figure 2: Example for an individual observation of the red agent in a gridworld domain. Agents are represented as colored circles, their goals as similarly-colored squares, and obstacles as black squares. Each agent $i$ has a limited field of view (FOV) of the environment map, which is centered around its location encoded by five channels: locations of obstacles, location of other agents' goals, locations of nearby agents, and location of the goal $v_{\textit{goal},i}$ if within the FOV, and the Manhattan distance and direction of agent $i$ to its goal.
Figure 3: Common actor-critic scheme as used in various prior work on cooperative MARL su2021valuephan2021resilient. A separate critic is trained for each actor using some centralized factorization operator $\Psi$ like QMIX or QPLEX phan2021vast.
Figure 4: Left: Comparison of the number of trainable parameters in PRIMAL and CACTUS. Note the logarithmic scale on the y-axis. Right: The schematic network architectures used for PRIMAL and CACTUS. The sizes do not reflect any quantity and only illustrate the components used for learning.
Figure 5: Average training progress of CACTUS variants, PRIMAL, and a naive MARL baseline without any curriculum w.r.t. training epochs (left) and training time (right). The performance is evaluated on all pre-generated test instances $I$ of sartoretti2019primal with $K \in \{10, 40, 80\}$, $\delta \in \{0, 0.1, 0.2, 0.3\}$, and $N = 8$ agents. Shaded areas show the 95% confidence interval.
...and 3 more figures

Confidence-Based Curriculum Learning for Multi-Agent Path Finding

TL;DR

Abstract

Confidence-Based Curriculum Learning for Multi-Agent Path Finding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)