Table of Contents
Fetching ...

Sequence Pathfinder for Multi-Agent Pickup and Delivery in the Warehouse

Zeyuan Zhao, Chaoran Li, Shao Zhang, Ying Wen

TL;DR

The paper tackles the challenge of Multi-Agent Pickup and Delivery (MAPD) in warehouse-style environments by reframing MAPF as a sequence modeling problem and proving order-invariant optimality for autoregressive pathfinding policies. It introduces SePar, a Transformer-based Sequential Pathfinder that enables implicit inter-agent information exchange and reduces decision complexity from exponential to linear. SePar combines PPO-based reinforcement learning with imitation learning and employs an Observation Feature Extractor and a Multi-Agent Transformer to generate joint actions. Empirical results on both a warehouse simulator and the POGEMA MAPF benchmarks show that SePar consistently outperforms most learning-based baselines, generalizes to unseen maps, and highlights imitation learning as essential for highly structured maps. The work advances scalable, globally informed MAPF/MAPD planning in realistic settings with substantial practical impact for warehouse robotics and multi-robot coordination.

Abstract

Multi-Agent Pickup and Delivery (MAPD) is a challenging extension of Multi-Agent Path Finding (MAPF), where agents are required to sequentially complete tasks with fixed-location pickup and delivery demands. Although learning-based methods have made progress in MAPD, they often perform poorly in warehouse-like environments with narrow pathways and long corridors when relying only on local observations for distributed decision-making. Communication learning can alleviate the lack of global information but introduce high computational complexity due to point-to-point communication. To address this challenge, we formulate MAPF as a sequence modeling problem and prove that path-finding policies under sequence modeling possess order-invariant optimality, ensuring its effectiveness in MAPD. Building on this, we propose the Sequential Pathfinder (SePar), which leverages the Transformer paradigm to achieve implicit information exchange, reducing decision-making complexity from exponential to linear while maintaining efficiency and global awareness. Experiments demonstrate that SePar consistently outperforms existing learning-based methods across various MAPF tasks and their variants, and generalizes well to unseen environments. Furthermore, we highlight the necessity of integrating imitation learning in complex maps like warehouses.

Sequence Pathfinder for Multi-Agent Pickup and Delivery in the Warehouse

TL;DR

The paper tackles the challenge of Multi-Agent Pickup and Delivery (MAPD) in warehouse-style environments by reframing MAPF as a sequence modeling problem and proving order-invariant optimality for autoregressive pathfinding policies. It introduces SePar, a Transformer-based Sequential Pathfinder that enables implicit inter-agent information exchange and reduces decision complexity from exponential to linear. SePar combines PPO-based reinforcement learning with imitation learning and employs an Observation Feature Extractor and a Multi-Agent Transformer to generate joint actions. Empirical results on both a warehouse simulator and the POGEMA MAPF benchmarks show that SePar consistently outperforms most learning-based baselines, generalizes to unseen maps, and highlights imitation learning as essential for highly structured maps. The work advances scalable, globally informed MAPF/MAPD planning in realistic settings with substantial practical impact for warehouse robotics and multi-robot coordination.

Abstract

Multi-Agent Pickup and Delivery (MAPD) is a challenging extension of Multi-Agent Path Finding (MAPF), where agents are required to sequentially complete tasks with fixed-location pickup and delivery demands. Although learning-based methods have made progress in MAPD, they often perform poorly in warehouse-like environments with narrow pathways and long corridors when relying only on local observations for distributed decision-making. Communication learning can alleviate the lack of global information but introduce high computational complexity due to point-to-point communication. To address this challenge, we formulate MAPF as a sequence modeling problem and prove that path-finding policies under sequence modeling possess order-invariant optimality, ensuring its effectiveness in MAPD. Building on this, we propose the Sequential Pathfinder (SePar), which leverages the Transformer paradigm to achieve implicit information exchange, reducing decision-making complexity from exponential to linear while maintaining efficiency and global awareness. Experiments demonstrate that SePar consistently outperforms existing learning-based methods across various MAPF tasks and their variants, and generalizes well to unseen environments. Furthermore, we highlight the necessity of integrating imitation learning in complex maps like warehouses.

Paper Structure

This paper contains 31 sections, 1 theorem, 19 equations, 11 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Order-invariant Optimality of Autoregressive Pathfinding Policies Let $n\ge 1$ agents act simultaneously in the environment. For any permutation $\sigma\in S_n$, an autoregressive policy can be defined as: where $S_n$ is the set of all possible permutations of $n$ elements, $\sigma[k]$ represents the $k$-th element of permutation $\sigma$ and $\theta$ is the policy parameter. Then Let $f(\cdot)$

Figures (11)

  • Figure 1: Comparison of two grid environments. Left: an open-space map with wide free space and multiple routing options. Right: a warehouse-like map with long, corridors and narrow pathways. These contrasting structures illustrate the intuitive difference in pathfinding difficulty.
  • Figure 2: Observation space of the agents (here for agent 1, in red). Agents and corridor endpoints are represented by different colors. Circles, squares, and hexagons denote agents’ goals, positions, and statuses, while diamonds mark corridor endpoints.
  • Figure 3: Path Finding Complexity Index of maps. The orange ones are from the MAPF benchmark named POGEMA, and the blue ones are from our warehouse simulator.
  • Figure 4: Network Structure of the Transformer Pathfinder. At each step, the Observation Feature Extractor generates embeddings from agents' observations. The encoder refines these into new observations for the decoder, which uses masked attention to block access to subsequent agents' actions, ensuring each agent acts sequentially based on preceding agents.
  • Figure 5: Results in the warehouse simulation environment. \ref{['wh_small']} and \ref{['wh_large']} present results on warehouse_small and warehouse_large. SePar outperforms the other learnable methods and shows scalability as agent numbers grow. \ref{['runtime']} shows the average time per timestep for LaCAM2 and SePar to generate actions.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Proposition 1