Table of Contents
Fetching ...

Take a Step and Reconsider: Sequence Decoding for Self-Improved Neural Combinatorial Optimization

Jonathan Pirnay, Dominik G. Grimm

TL;DR

This paper presents a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement that outperforms previous NCO approaches on the Job Shop Scheduling Problem.

Abstract

The constructive approach within Neural Combinatorial Optimization (NCO) treats a combinatorial optimization problem as a finite Markov decision process, where solutions are built incrementally through a sequence of decisions guided by a neural policy network. To train the policy, recent research is shifting toward a 'self-improved' learning methodology that addresses the limitations of reinforcement learning and supervised approaches. Here, the policy is iteratively trained in a supervised manner, with solutions derived from the current policy serving as pseudo-labels. The way these solutions are obtained from the policy determines the quality of the pseudo-labels. In this paper, we present a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement. We incrementally follow the best solution found and repeat the sampling process from intermediate partial solutions. By modifying the policy to ignore previously sampled sequences, we force it to consider only unseen alternatives, thereby increasing solution diversity. Experimental results for the Traveling Salesman and Capacitated Vehicle Routing Problem demonstrate its strong performance. Furthermore, our method outperforms previous NCO approaches on the Job Shop Scheduling Problem.

Take a Step and Reconsider: Sequence Decoding for Self-Improved Neural Combinatorial Optimization

TL;DR

This paper presents a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement that outperforms previous NCO approaches on the Job Shop Scheduling Problem.

Abstract

The constructive approach within Neural Combinatorial Optimization (NCO) treats a combinatorial optimization problem as a finite Markov decision process, where solutions are built incrementally through a sequence of decisions guided by a neural policy network. To train the policy, recent research is shifting toward a 'self-improved' learning methodology that addresses the limitations of reinforcement learning and supervised approaches. Here, the policy is iteratively trained in a supervised manner, with solutions derived from the current policy serving as pseudo-labels. The way these solutions are obtained from the policy determines the quality of the pseudo-labels. In this paper, we present a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement. We incrementally follow the best solution found and repeat the sampling process from intermediate partial solutions. By modifying the policy to ignore previously sampled sequences, we force it to consider only unseen alternatives, thereby increasing solution diversity. Experimental results for the Traveling Salesman and Capacitated Vehicle Routing Problem demonstrate its strong performance. Furthermore, our method outperforms previous NCO approaches on the Job Shop Scheduling Problem.
Paper Structure (34 sections, 3 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of self-improved training with TSP as an illustrative example. A partial solution corresponds to a partial (unfinished) tour. The next sequence element for the model to predict is the next edge to be appended to the partial tour.
  • Figure 2: Example of sequence decoding with beam width $k=3$ and step size $s=2$. a.) We sample $k$ leaves WOR (indicated in red) from the root node (dashed outline), creating nodes on demand. We follow the trajectory of the best solution for $s$ steps (indicated in blue). b.) We shift the root node $s$, disregard the rest of the tree and remove the probability mass of sampled leaves (grayed out) from their ancestors. We sample $k$ unseen alternatives from the new root and find a better solution. We follow the new solution for $s$ steps. c.) After shifting the root again, only one leaf is left to sample, which does not improve the current best solution.
  • Figure 3: Decoding the policy with our sequence decoding method ('Ours') compared to sampling sequences without replacement with SBS ('Sample WOR') and to the sampling method GD pirnay2024self. The number of sequences sampled WOR and with GD are given as multiples of the beam width $k$ to ensure alignment of the computational effort. Points with same marker mean same compute budget. For the routing problems, we average optimality gaps across 100 instances. For the JSSP, the corresponding Taillard benchmark set is used. Sampling for each data point is repeated ten times; shades denote standard errors.