Improving Learnt Local MAPF Policies with Heuristic Search

Rishi Veerapaneni; Qian Wang; Kevin Ren; Arthur Jakobsson; Jiaoyang Li; Maxim Likhachev

Improving Learnt Local MAPF Policies with Heuristic Search

Rishi Veerapaneni, Qian Wang, Kevin Ren, Arthur Jakobsson, Jiaoyang Li, Maxim Likhachev

TL;DR

This work tackles the scalability gap of learned local MAPF policies by integrating them with heuristic search. It introduces CS-PIBT as a model-agnostic collision shield and leverages LaCAM to enable full-horizon planning with theoretical completeness, substantially boosting success rates and scalability, even at high congestion ($0.2$ density). Empirical results show that CS-PIBT markedly improves performance over naive shielding and that LaCAM can exploit learned policies to reach hundreds of agents under timeouts, though performance is sensitive to heuristic quality. The study also analyzes how randomness in action ordering affects outcomes and demonstrates when learned policies are advantageous, highlighting the continued strength of classical heuristics and the potential of learned components in more complex, high-dimensional MAPF scenarios.

Abstract

Multi-agent path finding (MAPF) is the problem of finding collision-free paths for a team of agents to reach their goal locations. State-of-the-art classical MAPF solvers typically employ heuristic search to find solutions for hundreds of agents but are typically centralized and can struggle to scale when run with short timeouts. Machine learning (ML) approaches that learn policies for each agent are appealing as these could enable decentralized systems and scale well while maintaining good solution quality. Current ML approaches to MAPF have proposed methods that have started to scratch the surface of this potential. However, state-of-the-art ML approaches produce "local" policies that only plan for a single timestep and have poor success rates and scalability. Our main idea is that we can improve a ML local policy by using heuristic search methods on the output probability distribution to resolve deadlocks and enable full horizon planning. We show several model-agnostic ways to use heuristic search with learnt policies that significantly improve the policies' success rates and scalability. To our best knowledge, we demonstrate the first time ML-based MAPF approaches have scaled to high congestion scenarios (e.g. 20% agent density).

Improving Learnt Local MAPF Policies with Heuristic Search

TL;DR

density). Empirical results show that CS-PIBT markedly improves performance over naive shielding and that LaCAM can exploit learned policies to reach hundreds of agents under timeouts, though performance is sensitive to heuristic quality. The study also analyzes how randomness in action ordering affects outcomes and demonstrates when learned policies are advantageous, highlighting the continued strength of classical heuristics and the potential of learned components in more complex, high-dimensional MAPF scenarios.

Abstract

Paper Structure (27 sections, 1 equation, 6 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 1 equation, 6 figures, 1 table, 2 algorithms.

Introduction
Related Works
MAPF Problem Formulation
Heuristic Search Approaches
PIBT
LaCAM
Machine Learning Approaches
Improving Learnt Local Policies with Heuristic Search
Handling 1-step Collisions with PIBT
CS-NAIVE
CS-PIBT
Full Horizon Planning with LaCAM
Combining a Local Policy with a Heuristic
Experimental Results
Learnt Policies used for Evaluation
...and 12 more sections

Figures (6)

Figure 1: We compare running MAGAT li2021magat with its default collision shielding (blue) vs running MAGAT with our PIBT collision shielding (orange). MAGAT is a learnt local policy that predicts a one-step policy per agent that could lead to collisions and therefore requires collision shielding to prevent collisions. PIBT pibt is a heuristic search technique for solving MAPF. We see that using the exact same learnt model with our PIBT-based collision shielding improves performance and scalability without any additional training or information.
Figure 2: Given a learnt local MAPF policy which returns 1-step action distributions (depicted as black arrows overlaid on colored agents with larger magnitude depicting higher probability), we need to resolve collisions that might occur if we followed the proposed actions. We depict an example where blue and green would collide with each other. Existing work uses a "naive collision shield" which only uses the agents' picked actions and replaces collisions with wait actions, which can cause deadlock between agents. We propose using the PIBT collision shield (CS-PIBT) to resolve 1-step collisions and reduce deadlock. Note that CS-PIBT uses the entire action distribution of the agent. To enable full horizon planning, we can use the LaCAM framework with the learnt policy with CS-PIBT as the configuration generator as defined in okumura2022lacam. LaCAM in essence conducts a DFS over the joint-configuration space, enabling it to escape local minima by backtracking and improving success rates.
Figure 3: We plot the effect of using CS-PIBT with and without biased sampling. We see that including sampling significantly improves performance instead of always choosing actions with the highest probability first.
Figure 4: $\downarrow$ is better. We evaluate different methods of combining MAGAT's local policy with the standard backward Dijkstra's heuristic used in LaCAM. We evaluate the cost differences with respect to regular LaCAM informed by $h_{BD}$ and random tie-breaking (blue). LaCAM2 (green) tie-breaks preferring locations without other agents which was shown to improve solution cost (but reduce success rate) in pibt. "Tie" (red) tie-breaks $h_{BD}$ by using MAGAT's preferences. MAGAT (black) disregards $h_{BD}$. Intermediate "R" methods (purple, orange, cyan) sort using a weighted combination of $h_{BD}$ and MAGAT's probabilities. We see that tie-breaking improves solution cost over LaCAM, LaCAM2, and MAGAT.
Figure 5: We explore the effect of imperfect heuristics on PIBT and LaCAM. Given a setting of "$K\%$" imperfection and the perfect backward Dijkstra's heuristic $h_{BD}(s)$, we uniformly sample from $e \sim [1-K/100,1]$ and obtain our imperfect $\bar{h}_{BD}(s) = e \times h_{BD}(s)$ which adds noise to the calculated heuristic values to simulate uncertainty. (a) shows the performance with PIBT and LaCAM in $K=0$ (cyan), $5$ (brown), $10$ (green), and $20$ (red) $\bar{h}_{BD}$-imperfect settings. We additionally plot MAGAT which does not use $h_{BD}$ and is thus independent of the heuristic imperfections. From the success rate, we see that PIBT completely fails starting at $K=10$ (hidden by red triangle) and LaCAM fails at $K=20$. However, the solution cost in (b) reveals a worse picture; even though LaCAM succeeds with $K=10$ or $20$, the solutions are extremely suboptimal with LaCAM finding 100 times worse solutions. (c) highlights that LaCAM can be extremely brittle with it performing reasonably at $6\%$ but producing substantially worse solutions at $7\%$.
...and 1 more figures

Improving Learnt Local MAPF Policies with Heuristic Search

TL;DR

Abstract

Improving Learnt Local MAPF Policies with Heuristic Search

Authors

TL;DR

Abstract

Table of Contents

Figures (6)