Table of Contents
Fetching ...

TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-stage Self-play for Multi-constrained Electric Vehicle Routing Problems

Hui Wang, Xufeng Zhang, Xiaoyu Zhang, Zhenhuan Ding, Chaoxu Mu

TL;DR

The paper introduces TSS GAZ PTP, a two-stage self-play training framework that strengthens Gumbel AlphaZero for combinatorial optimization problems by forcing a learning agent to compete against opponents of increasing strength. The method first uses a greedy, historical-best opponent, then engages both players in Gumbel MCTS to deepen learning, yielding faster convergence and stronger trajectories. It validates the approach on TSP and extends it to multi-constrained EVRP, achieving state-of-the-art results against deep RL baselines and outperforming solvers on large instances. The work demonstrates that dynamic, stage-aware self-play can significantly improve performance on variable-step CO problems like EVRP, with practical implications for energy-efficient routing.

Abstract

Recently, Gumbel AlphaZero~(GAZ) was proposed to solve classic combinatorial optimization problems such as TSP and JSSP by creating a carefully designed competition model~(consisting of a learning player and a competitor player), which leverages the idea of self-play. However, if the competitor is too strong or too weak, the effectiveness of self-play training can be reduced, particularly in complex CO problems. To address this problem, we further propose a two-stage self-play strategy to improve the GAZ method~(named TSS GAZ PTP). In the first stage, the learning player uses the enhanced policy network based on the Gumbel Monte Carlo Tree Search~(MCTS), and the competitor uses the historical best trained policy network~(acts as a greedy player). In the second stage, we employ Gumbel MCTS for both players, which makes the competition fiercer so that both players can continuously learn smarter trajectories. We first investigate the performance of our proposed TSS GAZ PTP method on TSP since it is also used as a test problem by the original GAZ. The results show the superior performance of TSS GAZ PTP. Then we extend TSS GAZ PTP to deal with multi-constrained Electric Vehicle Routing Problems~(EVRP), which is a recently well-known real application research topic and remains challenging as a complex CO problem. Impressively, the experimental results show that the TSS GAZ PTP outperforms the state-of-the-art Deep Reinforcement Learning methods in all types of instances tested and outperforms the optimization solver in tested large-scale instances, indicating the importance and promising of employing more dynamic self-play strategies for complex CO problems.

TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-stage Self-play for Multi-constrained Electric Vehicle Routing Problems

TL;DR

The paper introduces TSS GAZ PTP, a two-stage self-play training framework that strengthens Gumbel AlphaZero for combinatorial optimization problems by forcing a learning agent to compete against opponents of increasing strength. The method first uses a greedy, historical-best opponent, then engages both players in Gumbel MCTS to deepen learning, yielding faster convergence and stronger trajectories. It validates the approach on TSP and extends it to multi-constrained EVRP, achieving state-of-the-art results against deep RL baselines and outperforming solvers on large instances. The work demonstrates that dynamic, stage-aware self-play can significantly improve performance on variable-step CO problems like EVRP, with practical implications for energy-efficient routing.

Abstract

Recently, Gumbel AlphaZero~(GAZ) was proposed to solve classic combinatorial optimization problems such as TSP and JSSP by creating a carefully designed competition model~(consisting of a learning player and a competitor player), which leverages the idea of self-play. However, if the competitor is too strong or too weak, the effectiveness of self-play training can be reduced, particularly in complex CO problems. To address this problem, we further propose a two-stage self-play strategy to improve the GAZ method~(named TSS GAZ PTP). In the first stage, the learning player uses the enhanced policy network based on the Gumbel Monte Carlo Tree Search~(MCTS), and the competitor uses the historical best trained policy network~(acts as a greedy player). In the second stage, we employ Gumbel MCTS for both players, which makes the competition fiercer so that both players can continuously learn smarter trajectories. We first investigate the performance of our proposed TSS GAZ PTP method on TSP since it is also used as a test problem by the original GAZ. The results show the superior performance of TSS GAZ PTP. Then we extend TSS GAZ PTP to deal with multi-constrained Electric Vehicle Routing Problems~(EVRP), which is a recently well-known real application research topic and remains challenging as a complex CO problem. Impressively, the experimental results show that the TSS GAZ PTP outperforms the state-of-the-art Deep Reinforcement Learning methods in all types of instances tested and outperforms the optimization solver in tested large-scale instances, indicating the importance and promising of employing more dynamic self-play strategies for complex CO problems.

Paper Structure

This paper contains 19 sections, 19 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: The basic framework of proposed Two-stage Self-play Gumbel AlphaZero (TSS GAZ PTP). Red part represents the action selection is made by learning player with Gumbel MCTS, blue part represents the action selection is made by competitor player with either greedy strategy in stage 1 or Gumbel MCTS in stage 2.
  • Figure 2: A simple example of the multi-constrained EVRP path planning
  • Figure 3: The comparison of Gumbel MCTS for Stage 1 and 2
  • Figure 4: Vanilla Transformer block (left) and our Transformer block (right) that adds gate aggregation
  • Figure 5: Results of comparison experiments among GAZ PTP, GAZ PTP (fine tuned) and TSS GAZ PTP on category C10-S4 for EVRP. Our proposed method achieves the lowest SOC consumption. And SOC consumption drastically decreased after 20K episodes, indicating that our proposed method can get rid of the local optimal and achieve better performance.
  • ...and 4 more figures