Learn to Tour: Operator Design For Solution Feasibility Mapping in Pickup-and-delivery Traveling Salesman Problem
Bowen Fang, Xu Chen, Xuan Di
TL;DR
The paper tackles PDTSP, where each pickup node $i$ must precede its corresponding delivery node $n+i$ via the constraint $p_i<d_{n+i}$, and traditional solvers struggle to scale. It introduces L2T, a reinforcement-learning framework that uses a unified operator set (N1,N2,N3,B1,B2) designed to map feasible tours to other feasible tours, thereby confining search to the feasible solution space. A key idea is to represent tours as sequences of pickup and delivery blocks and to construct initial feasible tours via a simple feasibility-based rule; the policy network learns to select operators to iteratively improve tour cost, with a PPO optimization and a feature-rich architecture. Empirical results on Grubhub PDTSP instances and a Capacitated-PDTSP scenario show that L2T achieves shorter tours and superior scalability compared to strong baselines such as OR-tools, Gurobi, Ptr-Net, Transformer, and LKH3, highlighting its practical impact for large-scale pickup-delivery routing tasks.
Abstract
This paper aims to develop a learning method for a special class of traveling salesman problems (TSP), namely, the pickup-and-delivery TSP (PDTSP), which finds the shortest tour along a sequence of one-to-one pickup-and-delivery nodes. One-to-one here means that the transported people or goods are associated with designated pairs of pickup and delivery nodes, in contrast to that indistinguishable goods can be delivered to any nodes. In PDTSP, precedence constraints need to be satisfied that each pickup node must be visited before its corresponding delivery node. Classic operations research (OR) algorithms for PDTSP are difficult to scale to large-sized problems. Recently, reinforcement learning (RL) has been applied to TSPs. The basic idea is to explore and evaluate visiting sequences in a solution space. However, this approach could be less computationally efficient, as it has to potentially evaluate many infeasible solutions of which precedence constraints are violated. To restrict solution search within a feasible space, we utilize operators that always map one feasible solution to another, without spending time exploring the infeasible solution space. Such operators are evaluated and selected as policies to solve PDTSPs in an RL framework. We make a comparison of our method and baselines, including classic OR algorithms and existing learning methods. Results show that our approach can find tours shorter than baselines.
