LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

Yutong Wang; Tanishq Duhan; Jiaoyang Li; Guillaume Sartoretti

LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

Yutong Wang, Tanishq Duhan, Jiaoyang Li, Guillaume Sartoretti

TL;DR

The paper presents LNS2+RL, a MAPF algorithm that combines large neighborhood search with a MARL-based replanner to balance path quality and replanning speed. Early iterations use MARL to reduce collisions, then adaptively switch to a fast PP+SIPPS phase to finish planning efficiently, with SIPPS supplementation ensuring completeness. Across high-density maps and diverse map structures, LNS2+RL achieves higher success rates and competitive costs compared to baselines like LNS2, EECBS, SCRIMP, and LaCAM, and demonstrates practical viability in a hybrid warehouse setting using an Action Dependency Graph for execution. The work introduces a detailed MARL environment for PMDO-style replanning, a two-stage training regime with curriculum and imitation learning, and extensive real-world validation, highlighting scalability to thousands of agents and applicability to real-world warehouse robotics.

Abstract

Multi-Agent Path Finding (MAPF) is a critical component of logistics and warehouse management, which focuses on planning collision-free paths for a team of robots in a known environment. Recent work introduced a novel MAPF approach, LNS2, which proposed to repair a quickly obtained set of infeasible paths via iterative replanning, by relying on a fast, yet lower-quality, prioritized planning (PP) algorithm. At the same time, there has been a recent push for Multi-Agent Reinforcement Learning (MARL) based MAPF algorithms, which exhibit improved cooperation over such PP algorithms, although inevitably remaining slower. In this paper, we introduce a new MAPF algorithm, LNS2+RL, which combines the distinct yet complementary characteristics of LNS2 and MARL to effectively balance their individual limitations and get the best from both worlds. During early iterations, LNS2+RL relies on MARL for low-level replanning, which we show eliminates collisions much more than a PP algorithm. There, our MARL-based planner allows agents to reason about past and future information to gradually learn cooperative decision-making through a finely designed curriculum learning. At later stages of planning, LNS2+RL adaptively switches to PP algorithm to quickly resolve the remaining collisions, naturally trading off solution quality (number of collisions in the solution) and computational efficiency. Our comprehensive experiments on high-agent-density tasks across various team sizes, world sizes, and map structures consistently demonstrate the superior performance of LNS2+RL compared to many MAPF algorithms, including LNS2, LaCAM, EECBS, and SCRIMP. In maps with complex structures, the advantages of LNS2+RL are particularly pronounced, with LNS2+RL achieving a success rate of over 50% in nearly half of the tested tasks, while that of LaCAM, EECBS and SCRIMP falls to 0%.

LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

TL;DR

Abstract

Paper Structure (38 sections, 2 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 38 sections, 2 equations, 8 figures, 11 tables, 1 algorithm.

Introduction
Prior Work
Background
Multi-Agent Path Finding
LNS2
Path Finding with Mixed Dynamic Obstacles
Method: LNS2+RL
Overall Framework
Replanning Tasks as a MARL Problem
RL Environment Setup
Observation and Reward
1. Design from prior work
2. Reference path
3. Avoid collisions
4. Alleviate congestion
...and 23 more sections

Figures (8)

Figure 1: Network structure. AP, Cat, LN, and FC represent average pooling, concatenation, layer normalization, and fully connected layer, respectively. $h_{t-1}^i$ represents the hidden state output by the LSTM unit at the previous time step.
Figure 2: Result on different map types and visualization of representative maps. On the x-axis, the percentage represents agent density, while the number below it represents the number of agents. The success rate is the percentage of tasks that are fully solved before the termination condition is met. The remaining colliding pairs metric represents the number of CP remaining in the overall solution after reaching the time constraint and is only available for LNS2+RL and LNS2. The shaded area represents one standard deviation of the remaining CP. All three random maps have a 17.5% obstacle density. Except for the random-small map with a size of $10 \times 10$ and the random-large map with a size of $50 \times 50$, all other five maps have a size of $25 \times 25$. For maps with sizes $10 \times 10$, $25 \times 25$, and $50 \times 50$, the test counts are set to 100, 50, and 20, respectively, and the time constraints for each task, except when using SCRIMP, are set to 100 seconds, 600 seconds, and 5,000 seconds. To ensure consistency with the original paper, SCRIMP uses the maximum timestep as the termination condition, with maximum timesteps of 356, 556, and 1,024 for maps sized $10 \times 10$, $25 \times 25$, and $50 \times 50$, respectively. Figure \ref{['fig:maps']} shows the visualization of the random-medium (top left), maze (top right), room (bottom left), and warehouse (bottom right) map.
Figure 3: Number of remaining CP through the iterations. The dashed blue line shows the switch from MARL to PP+SIPPS.
Figure 4: Average performance along MAPF metrics of different algorithms for instances with different world sizes, team sizes, and map structure. For maps with sizes $10 \times 10$, $25 \times 25$, and $50 \times 50$, the test counts are set to 100, 50, and 20, respectively, and the time constraints for each task, except when using SCRIMP, are set to 100 seconds, 600 seconds, and 5,000 seconds. To ensure consistency with the original paper, SCRIMP uses the maximum timestep as the termination condition, with maximum timesteps of 356, 556, and 1024 for maps sized $10 \times 10$, $25 \times 25$, and $50 \times 50$, respectively. The success rate is the percentage of tasks fully solved before the termination condition is met(i.e., where all agents reached their goal without collisions). The sum of cost only accounts for successful tasks. Partial runtime data for LaCAM and SCRIMP is unavailable because some tasks in the test exceeded the maximum memory limit we set (100GB), leading to premature termination. The remaining colliding pairs metric represents the number of CP remaining in the overall solution after reaching the time constraint and is only available for LNS2+RL and LNS2. Percentages in parentheses represent the reduced remaining CP ratio of LNS2+RL compared to LNS2. ↑ indicates that “higher is better”, and ↓ “lower is better”. ”-” represents unavailable data. The best-performing algorithms in each task are highlighted in bold. Because SCRIMP uses maximum timestep as the termination condition, it is excluded from the selection of the shortest time.
Figure 5: Result with different runtime limit and lower agent density.
...and 3 more figures

LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

TL;DR

Abstract

LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)