Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling
Cong Zhang, Zhiguang Cao, Wen Song, Yaoxin Wu, Jie Zhang
TL;DR
This work introduces a DRL-guided improvement heuristic for Job Shop Scheduling that encodes complete solutions as disjunctive graphs and employs a two-module GNN (TPM and CAM) to capture topological and contextual information during search. A novel n-step REINFORCE training regime and a batch-oriented message-passing evaluator enable efficient, scalable evaluation of many neighbor solutions, with the policy yielding linear time complexity in the problem size. Empirical results on seven benchmarks show the method consistently outperforms state-of-the-art DRL baselines and hand-crafted rules, and even surpasses CP-SAT on very large instances within practical time budgets. The approach significantly narrows the gap to optimality, demonstrates strong generalization to longer improvement horizons, and offers a practical, scalable DRL framework for scheduling in manufacturing settings.
Abstract
Recent studies in using deep reinforcement learning (DRL) to solve Job-shop scheduling problems (JSSP) focus on construction heuristics. However, their performance is still far from optimality, mainly because the underlying graph representation scheme is unsuitable for modelling partial solutions at each construction step. This paper proposes a novel DRL-guided improvement heuristic for solving JSSP, where graph representation is employed to encode complete solutions. We design a Graph Neural-Network-based representation scheme, consisting of two modules to effectively capture the information of dynamic topology and different types of nodes in graphs encountered during the improvement process. To speed up solution evaluation during improvement, we present a novel message-passing mechanism that can evaluate multiple solutions simultaneously. We prove that the computational complexity of our method scales linearly with problem size. Experiments on classic benchmarks show that the improvement policy learned by our method outperforms state-of-the-art DRL-based methods by a large margin.
