Table of Contents
Fetching ...

SED2AM: Solving Multi-Trip Time-Dependent Vehicle Routing Problem using Deep Reinforcement Learning

Arash Mozhdehi, Yunli Wang, Sun Sun, Xin Wang

TL;DR

This work tackles the multi-trip time-dependent vehicle routing problem with maximum working hours constraints by introducing SED2AM, a Transformer-based DRL policy with a simultaneous encoder and dual decoders (vehicle selection and trip construction). By incorporating time-interval-specific embeddings and a fleet-aware state representation, the method captures dynamic travel times and assigns trips to vehicles while respecting capacity and working-time limits. Training uses REINFORCE with a greedy rollout baseline, and experiments on Edmonton and Calgary data show SED2AM outperforms state-of-the-art DRL baselines and heuristic methods, with strong generalization to larger problems. The results indicate significant practical impact for urban logistics, enabling faster, more scalable, and adaptive routing under time-varying traffic and regulatory constraints, with future work targeting mixed pickups/deliveries and time-window constraints.

Abstract

Deep reinforcement learning (DRL)-based frameworks, featuring Transformer-style policy networks, have demonstrated their efficacy across various vehicle routing problem (VRP) variants. However, the application of these methods to the multi-trip time-dependent vehicle routing problem (MTTDVRP) with maximum working hours constraints -- a pivotal element of urban logistics -- remains largely unexplored. This paper introduces a DRL-based method called the Simultaneous Encoder and Dual Decoder Attention Model (SED2AM), tailored for the MTTDVRP with maximum working hours constraints. The proposed method introduces a temporal locality inductive bias to the encoding module of the policy networks, enabling it to effectively account for the time-dependency in travel distance or time. The decoding module of SED2AM includes a vehicle selection decoder that selects a vehicle from the fleet, effectively associating trips with vehicles for functional multi-trip routing. Additionally, this decoding module is equipped with a trip construction decoder leveraged for constructing trips for the vehicles. This policy model is equipped with two classes of state representations, fleet state and routing state, providing the information needed for effective route construction in the presence of maximum working hours constraints. Experimental results using real-world datasets from two major Canadian cities not only show that SED2AM outperforms the current state-of-the-art DRL-based and metaheuristic-based baselines but also demonstrate its generalizability to solve larger-scale problems.

SED2AM: Solving Multi-Trip Time-Dependent Vehicle Routing Problem using Deep Reinforcement Learning

TL;DR

This work tackles the multi-trip time-dependent vehicle routing problem with maximum working hours constraints by introducing SED2AM, a Transformer-based DRL policy with a simultaneous encoder and dual decoders (vehicle selection and trip construction). By incorporating time-interval-specific embeddings and a fleet-aware state representation, the method captures dynamic travel times and assigns trips to vehicles while respecting capacity and working-time limits. Training uses REINFORCE with a greedy rollout baseline, and experiments on Edmonton and Calgary data show SED2AM outperforms state-of-the-art DRL baselines and heuristic methods, with strong generalization to larger problems. The results indicate significant practical impact for urban logistics, enabling faster, more scalable, and adaptive routing under time-varying traffic and regulatory constraints, with future work targeting mixed pickups/deliveries and time-window constraints.

Abstract

Deep reinforcement learning (DRL)-based frameworks, featuring Transformer-style policy networks, have demonstrated their efficacy across various vehicle routing problem (VRP) variants. However, the application of these methods to the multi-trip time-dependent vehicle routing problem (MTTDVRP) with maximum working hours constraints -- a pivotal element of urban logistics -- remains largely unexplored. This paper introduces a DRL-based method called the Simultaneous Encoder and Dual Decoder Attention Model (SED2AM), tailored for the MTTDVRP with maximum working hours constraints. The proposed method introduces a temporal locality inductive bias to the encoding module of the policy networks, enabling it to effectively account for the time-dependency in travel distance or time. The decoding module of SED2AM includes a vehicle selection decoder that selects a vehicle from the fleet, effectively associating trips with vehicles for functional multi-trip routing. Additionally, this decoding module is equipped with a trip construction decoder leveraged for constructing trips for the vehicles. This policy model is equipped with two classes of state representations, fleet state and routing state, providing the information needed for effective route construction in the presence of maximum working hours constraints. Experimental results using real-world datasets from two major Canadian cities not only show that SED2AM outperforms the current state-of-the-art DRL-based and metaheuristic-based baselines but also demonstrate its generalizability to solve larger-scale problems.

Paper Structure

This paper contains 25 sections, 21 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Example of MTTDVRP with maximum working hours constraints. This illustration depicts multiple trips conducted by two vehicles within a fleet, adhering to a maximum working hours limit of 13 hours (7 AM to 8 PM) and a vehicle capacity of 25 units per trip. The red numbers above each customer location represent the customer number, while the green labels indicate travel times between each origin and destination based on traffic conditions. The table summarizes the time each customer is served, the customer's demand, and which vehicle on which trip served the customer.
  • Figure 3: Overall architecture of SED2AM
  • Figure 4: Illustration of the simultaneous encoder architecture: the encoder computes time-dependent node embeddings by simultaneously co-embedding the node features and edge features through $L$ layers of attention.
  • Figure 5: At each decoding step $t$, the vehicle selection decoder generates a probability distribution over the vehicles, selecting one to expand its route.
  • Figure 6: The architecture of the trip construction decoder. At each decoding step, this decoder outputs a probability distribution over unvisited nodes along with the depot based on the current state of the selected vehicle, graph embedding, and remaining nodes' time-dependent embeddings. Based on these probabilities, the node to be visited by the corresponding vehicle is then selected.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4