Table of Contents
Fetching ...

Leader Reward for POMO-Based Neural Combinatorial Optimization

Chaoyang Wang, Pengzhi Cheng, Jingze Li, Weiwei Sun

TL;DR

This work targets neural combinatorial optimization by reframing training objectives to maximize the best solution across multiple inferences, rather than the average performance. It introduces Leader Reward, a simple augmentation to the REINFORCE gradient that emphasizes the leader trajectory, and demonstrates its effectiveness in two training phases within POMO-based models. The approach yields substantial improvements across TSP, CVRP, and FFSP, including orders-of-magnitude reductions in gap to optimum for TSP100 when combined with inference strategies like SGBS+EAS, with minimal overhead. The results suggest Leader Reward as a broadly applicable, practical enhancement for neural CO solvers, compatible with various models and inference techniques, and offering meaningful gains in solution quality and generalization.

Abstract

Deep neural networks based on reinforcement learning (RL) for solving combinatorial optimization (CO) problems are developing rapidly and have shown a tendency to approach or even outperform traditional solvers. However, existing methods overlook an important distinction: CO problems differ from other traditional problems in that they focus solely on the optimal solution provided by the model within a specific length of time, rather than considering the overall quality of all solutions generated by the model. In this paper, we propose Leader Reward and apply it during two different training phases of the Policy Optimization with Multiple Optima (POMO) model to enhance the model's ability to generate optimal solutions. This approach is applicable to a variety of CO problems, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), and the Flexible Flow Shop Problem (FFSP), but also works well with other POMO-based models or inference phase's strategies. We demonstrate that Leader Reward greatly improves the quality of the optimal solutions generated by the model. Specifically, we reduce the POMO's gap to the optimum by more than 100 times on TSP100 with almost no additional computational overhead.

Leader Reward for POMO-Based Neural Combinatorial Optimization

TL;DR

This work targets neural combinatorial optimization by reframing training objectives to maximize the best solution across multiple inferences, rather than the average performance. It introduces Leader Reward, a simple augmentation to the REINFORCE gradient that emphasizes the leader trajectory, and demonstrates its effectiveness in two training phases within POMO-based models. The approach yields substantial improvements across TSP, CVRP, and FFSP, including orders-of-magnitude reductions in gap to optimum for TSP100 when combined with inference strategies like SGBS+EAS, with minimal overhead. The results suggest Leader Reward as a broadly applicable, practical enhancement for neural CO solvers, compatible with various models and inference techniques, and offering meaningful gains in solution quality and generalization.

Abstract

Deep neural networks based on reinforcement learning (RL) for solving combinatorial optimization (CO) problems are developing rapidly and have shown a tendency to approach or even outperform traditional solvers. However, existing methods overlook an important distinction: CO problems differ from other traditional problems in that they focus solely on the optimal solution provided by the model within a specific length of time, rather than considering the overall quality of all solutions generated by the model. In this paper, we propose Leader Reward and apply it during two different training phases of the Policy Optimization with Multiple Optima (POMO) model to enhance the model's ability to generate optimal solutions. This approach is applicable to a variety of CO problems, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), and the Flexible Flow Shop Problem (FFSP), but also works well with other POMO-based models or inference phase's strategies. We demonstrate that Leader Reward greatly improves the quality of the optimal solutions generated by the model. Specifically, we reduce the POMO's gap to the optimum by more than 100 times on TSP100 with almost no additional computational overhead.
Paper Structure (23 sections, 6 equations, 3 figures, 9 tables, 2 algorithms)

This paper contains 23 sections, 6 equations, 3 figures, 9 tables, 2 algorithms.

Figures (3)

  • Figure 1: Log probability curve during the training phase.
  • Figure 2: Gap of different sampling times on 10,000 instances of TSP100.
  • Figure 3: Comparison of learning curves when choosing different $\alpha$ for Leader Reward during the main training phase on the TSP, CVRP, and FFSP problem.

Theorems & Definitions (1)

  • proof