Table of Contents
Fetching ...

Reinforcement Learning for Solving Stochastic Vehicle Routing Problem with Time Windows

Zangir Iklassov, Ikboljon Sobirov, Ruben Solozabal, Martin Takac

TL;DR

This paper addresses the SVRP with Time Windows by introducing a reinforcement learning framework that accounts for stochastic demands and uncertain travel costs, while leveraging external information and time-window constraints. It develops an attention-based policy trained with REINFORCE (policy gradient) to minimize expected routing costs, including a recourse cost for failure scenarios, and compares against Clarke-Wright, Tabu Search, and Ant Colony Optimization baselines. The study demonstrates that the RL approach achieves a 1.73% travel-cost reduction over the strongest classical baseline and shows robustness across diverse environmental configurations, inference strategies, and problem sizes. It also examines the integration of external variables, inference techniques (greedy, sampling, beam search), and the impact of stochastic components on performance, offering a versatile benchmark for SVRP research and industry applications.

Abstract

This paper introduces a reinforcement learning approach to optimize the Stochastic Vehicle Routing Problem with Time Windows (SVRP), focusing on reducing travel costs in goods delivery. We develop a novel SVRP formulation that accounts for uncertain travel costs and demands, alongside specific customer time windows. An attention-based neural network trained through reinforcement learning is employed to minimize routing costs. Our approach addresses a gap in SVRP research, which traditionally relies on heuristic methods, by leveraging machine learning. The model outperforms the Ant-Colony Optimization algorithm, achieving a 1.73% reduction in travel costs. It uniquely integrates external information, demonstrating robustness in diverse environments, making it a valuable benchmark for future SVRP studies and industry application.

Reinforcement Learning for Solving Stochastic Vehicle Routing Problem with Time Windows

TL;DR

This paper addresses the SVRP with Time Windows by introducing a reinforcement learning framework that accounts for stochastic demands and uncertain travel costs, while leveraging external information and time-window constraints. It develops an attention-based policy trained with REINFORCE (policy gradient) to minimize expected routing costs, including a recourse cost for failure scenarios, and compares against Clarke-Wright, Tabu Search, and Ant Colony Optimization baselines. The study demonstrates that the RL approach achieves a 1.73% travel-cost reduction over the strongest classical baseline and shows robustness across diverse environmental configurations, inference strategies, and problem sizes. It also examines the integration of external variables, inference techniques (greedy, sampling, beam search), and the impact of stochastic components on performance, offering a versatile benchmark for SVRP research and industry applications.

Abstract

This paper introduces a reinforcement learning approach to optimize the Stochastic Vehicle Routing Problem with Time Windows (SVRP), focusing on reducing travel costs in goods delivery. We develop a novel SVRP formulation that accounts for uncertain travel costs and demands, alongside specific customer time windows. An attention-based neural network trained through reinforcement learning is employed to minimize routing costs. Our approach addresses a gap in SVRP research, which traditionally relies on heuristic methods, by leveraging machine learning. The model outperforms the Ant-Colony Optimization algorithm, achieving a 1.73% reduction in travel costs. It uniquely integrates external information, demonstrating robustness in diverse environments, making it a valuable benchmark for future SVRP studies and industry application.
Paper Structure (15 sections, 11 equations, 5 figures, 7 tables)

This paper contains 15 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: This Vehicle Routing Problem instance involves a graph with customer nodes and a central depot, all positioned within a two-dimensional coordinate system, where each customer node has a demand for goods and links between nodes have associated travel costs. Vehicles start from the depot and visit these customer nodes sequentially. Our formulation, tailored for industrial use, includes stochastic demands and undisclosed travel costs, as well as time windows for goods delivery. The decision-making agent operates in this stochastic environment, aiming to learn and adapt strategies to reduce expected travel costs.
  • Figure 2: Model Architecture. The lower segment depicts the input structure of the model. The model produces embeddings for two input types: Customer and Vehicle. These embeddings are then merged through an Attention Layer, yielding probabilities assigned to the nodes. These probabilities indicate the likelihood of each node being the subsequent position for each vehicle. $0$ node stands for the depot that has zero demand and is available all the time. Ultimately, the probabilities undergo a masking process to exclude customers within the route that have already been satisfied.
  • Figure 3: Percentage difference in travel costs between scenarios with correlated variables (setting $\mathrm{A},\mathrm{B},\Gamma=0.8,0.2,0.0$) and uncorrelated variables (setting $\mathrm{A},\mathrm{B},\Gamma=0.8,0.0,0.2$).
  • Figure 4: The incurred travel cost throughout the training phase, employing two distinct customer position approaches: fixed and flexible.
  • Figure 5: The travel cost incurred during the training phase, employing two different delivery approaches: full and partial.