Table of Contents
Fetching ...

Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

Zengyu Zou, Jingyuan Wang, Yixuan Huang, Junjie Wu

TL;DR

This work tackles the dynamic MVDPDPSR by introducing MAPT, a centralized decision framework that uses a Transformer-based encoder to represent all entities and a pointer-based autoregressive decoder to generate joint actions. A Relation-Aware Attention module captures inter-entity relationships, while informative priors bias exploration toward high-reward actions, and PPO trains the model. Key contributions include formal MDP formulation, a RA attention mechanism, autoregressive joint-action decoding, and informative priors with strong empirical validation on eight datasets showing improved solution quality and faster runtimes compared to classical methods. The framework advances real-time, scalable routing for on-demand物流 by effectively handling stochastic requests and multi-vehicle coordination.

Abstract

This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model's decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

TL;DR

This work tackles the dynamic MVDPDPSR by introducing MAPT, a centralized decision framework that uses a Transformer-based encoder to represent all entities and a pointer-based autoregressive decoder to generate joint actions. A Relation-Aware Attention module captures inter-entity relationships, while informative priors bias exploration toward high-reward actions, and PPO trains the model. Key contributions include formal MDP formulation, a RA attention mechanism, autoregressive joint-action decoding, and informative priors with strong empirical validation on eight datasets showing improved solution quality and faster runtimes compared to classical methods. The framework advances real-time, scalable routing for on-demand物流 by effectively handling stochastic requests and multi-vehicle coordination.

Abstract

This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model's decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

Paper Structure

This paper contains 42 sections, 25 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: The framework of MAPT. The blue arrow indicates that the elements in the sequence are generated in an AutoRegressive manner. The elements marked with * are the actions that need to be decoded.
  • Figure 2: Sensitivity analysis of $\beta$ across datasets.
  • Figure 3: Schematic illustration of the MVDPDPSR problem.
  • Figure 4: Comparison between Rolling Horizon and Markov Decision Process paradigms. The Rolling Horizon paradigm requires accumulating a sufficient number of requests before invoking a static solver for vehicle routing optimization, while the Markov Decision Process paradigm enables real-time decision-making as requests arrive.
  • Figure 5: Comparison between Non-AutoRegressive Decoding(up) and AutoRegressive Decoding(down).
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1: Stations
  • Definition 2: Vehicles
  • Definition 3: Requests
  • Definition 4: Observation/State
  • Definition 5: Action
  • Definition 6: Transition
  • Definition 7: Objective/Reward