End-to-end Deep Reinforcement Learning for Stochastic Multi-objective Optimization in C-VRPTW
Abdo Abouelrous, Laurens Bliek, Yaoxin Wu, Yingqian Zhang
TL;DR
The paper introduces an end-to-end deep reinforcement learning approach for stochastic, multi-objective vehicle routing with time windows (C-VRPTW). It combines a POMO-based MO component with Efficient Active Search and a scenario clustering strategy to handle travel-time uncertainty while constructing a Pareto front across objectives. Empirical results show competitive Pareto-front quality and substantially reduced training runtime compared to baselines, with insights from ablation studies on Monte Carlo evaluation and travel-time variability. The work demonstrates the feasibility and benefits of end-to-end MO optimization under stochasticity and offers a framework that can be extended to other routing problems.
Abstract
In this work, we consider learning-based applications in routing to solve a Vehicle Routing variant characterized by stochasticity and multiple objectives. Such problems are representative of practical settings where decision-makers have to deal with uncertainty in the operational environment as well as multiple conflicting objectives due to different stakeholders. We specifically consider travel time uncertainty. We also consider two objectives, total travel time and route makespan, that jointly target operational efficiency and labor regulations on shift length, although different objectives could be incorporated. Learning-based methods offer earnest computational advantages as they can repeatedly solve problems with limited interference from the decision-maker. We specifically focus on end-to-end deep learning models that leverage the attention mechanism and multiple solution trajectories. These models have seen several successful applications in routing problems. However, since travel times are not a direct input to these models due to the large dimensions of the travel time matrix, accounting for uncertainty is a challenge, especially in the presence of multiple objectives. In turn, we propose a model that simultaneously addresses stochasticity and multi-objectivity and provide a refined training mechanism for this model through scenario clustering to reduce training time. Our results show that our model is capable of constructing a Pareto Front of good quality within acceptable run times compared to three baselines.
