Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling

Arthur Müller; Lukas Vollenkemper

Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling

Arthur Müller, Lukas Vollenkemper

TL;DR

The paper tackles a real-world permutation flow shop scheduling problem with dual objectives: minimize tardiness via an exponential penalty $f_1(\sigma)=\sum_{i=1}^N e^{T_T(\sigma,i)}$ where $T_T(\sigma,i)=C_i-d_{\sigma(i)}$ and $C_i=T_W(W+i-1)$, and maximize worker well-being through $f_2(\sigma)=\sum_{w=1}^W\sum_{i=1}^{N-1}|p^{\sigma(i)}_w-p^{\sigma(i+1)}_w|$. An RL agent is trained as an improvement heuristic that starts from a due-date-sorted permutation $\sigma_0$ and iteratively performs pairwise swaps chosen by a Transformer-based policy, using PPO. The reward is a normalized combined objective $f_c$ that balances tardiness and stress, and the policy decodes a swap pair from a learned probability matrix. Experiments on real automotive data show that the proposed approach (notably RL-MPMR, which ensembles multiple policies) substantially outperforms simulated annealing and simple heuristics in both training and test sets, with a notable generalization gap likely due to limited training data. The work demonstrates practical potential for RL-guided improvement in production scheduling and outlines future work to scale to multiple lines, larger permutations, additional operators, and synthetic data to improve generalization.

Abstract

The integration of Reinforcement Learning (RL) with heuristic methods is an emerging trend for solving optimization problems, which leverages RL's ability to learn from the data generated during the search process. One promising approach is to train an RL agent as an improvement heuristic, starting with a suboptimal solution that is iteratively improved by applying small changes. We apply this approach to a real-world multiobjective production scheduling problem. Our approach utilizes a network architecture that includes Transformer encoding to learn the relationships between jobs. Afterwards, a probability matrix is generated from which pairs of jobs are sampled and then swapped to improve the solution. We benchmarked our approach against other heuristics using real data from our industry partner, demonstrating its superior performance.

Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling

TL;DR

The paper tackles a real-world permutation flow shop scheduling problem with dual objectives: minimize tardiness via an exponential penalty

where

and

, and maximize worker well-being through

. An RL agent is trained as an improvement heuristic that starts from a due-date-sorted permutation

and iteratively performs pairwise swaps chosen by a Transformer-based policy, using PPO. The reward is a normalized combined objective

that balances tardiness and stress, and the policy decodes a swap pair from a learned probability matrix. Experiments on real automotive data show that the proposed approach (notably RL-MPMR, which ensembles multiple policies) substantially outperforms simulated annealing and simple heuristics in both training and test sets, with a notable generalization gap likely due to limited training data. The work demonstrates practical potential for RL-guided improvement in production scheduling and outlines future work to scale to multiple lines, larger permutations, additional operators, and synthetic data to improve generalization.

Abstract

Paper Structure (21 sections, 20 equations, 4 figures, 1 table)

This paper contains 21 sections, 20 equations, 4 figures, 1 table.

Introduction
Problem Formulation
Method
Improvement Heuristic
RL Formulation
state
action
reward
policy
Network Architecture
Job Embedding
Job Pair Selection
Critic Network
Job Features
Experiment
...and 6 more sections

Figures (4)

Figure 1: Simplified illustration of the production line. In this example, seat $j_2$ is not completed by the due date. In addition, there is stress for the employee at workstation 3, as the consecutive seats $j_2$ to $j_4$ at this station have long processing times. The sequence of seats could be improved by swapping seats $j_2$ and $j_1$, as $j_1$ has a low processing time at workstation 3 and the tardiness of $j_2$ would be reduced.
Figure 2: Illustration of the $\text{swap}$-operator. $\text{swap}(\sigma_t, (3, 6))$ exchanges the positions of the third and 6th job in the permutation $\sigma_t$.
Figure 3: Architecture of the network used, consisting of a policy and critic network with shared parameters. The network employs Transformer encoding to process the features of the permutation $\sigma$. The policy network calculates a probability matrix, from which the next job pair to be swapped is selected ($\pi_\theta(a|s)$). The critic network estimates the remaining cumulative reward $v_\theta(s)$, which is used as information during training by the PPO algorithm. During inference, only the policy network is utilized.
Figure 4: The heatmap shows the buffer time $p^j_{w}-T_\text{W}$ of all operations in an example permutation of the train data. On the left, the permutation is sorted by due date. On the right, the permutation is shown after 10 swaps by the RL agent. Green indicates more buffer time, red less. It can be seen, that in the right permutation long and short operations alternate more often.

Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling

TL;DR

Abstract

Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling

Authors

TL;DR

Abstract

Table of Contents

Figures (4)