Table of Contents
Fetching ...

Integrating Human Knowledge Through Action Masking in Reinforcement Learning for Operations Research

Mirko Stappert, Bernhard Lutz, Niklas Goby, Dirk Neumann

TL;DR

This work examines how to integrate human knowledge into reinforcement learning for operations research by using action masking to constrain or guide policy learning. By formalizing action masks and their combinations, and applying them to paint shop scheduling, peak load management, and inventory management, the study demonstrates substantial performance gains and faster learning when human heuristics or provably optimal actions are enforced. However, it also shows risks: overly restrictive masks can hinder exploration and degrade performance in some settings. The results suggest action masking as a practical tool to improve trust, safety, and adoption of RL in real-world OR problems, with potential for post-hoc adjustments and hybrid approaches with reward shaping.

Abstract

Reinforcement learning (RL) provides a powerful method to address problems in operations research. However, its real-world application often fails due to a lack of user acceptance and trust. A possible remedy is to provide managers with the possibility of altering the RL policy by incorporating human expert knowledge. In this study, we analyze the benefits and caveats of including human knowledge via action masking. While action masking has so far been used to exclude invalid actions, its ability to integrate human expertise remains underexplored. Human knowledge is often encapsulated in heuristics, which suggest reasonable, near-optimal actions in certain situations. Enforcing such actions should hence increase trust among the human workforce to rely on the model's decisions. Yet, a strict enforcement of heuristic actions may also restrict the policy from exploring superior actions, thereby leading to overall lower performance. We analyze the effects of action masking based on three problems with different characteristics, namely, paint shop scheduling, peak load management, and inventory management. Our findings demonstrate that incorporating human knowledge through action masking can achieve substantial improvements over policies trained without action masking. In addition, we find that action masking is crucial for learning effective policies in constrained action spaces, where certain actions can only be performed a limited number of times. Finally, we highlight the potential for suboptimal outcomes when action masks are overly restrictive.

Integrating Human Knowledge Through Action Masking in Reinforcement Learning for Operations Research

TL;DR

This work examines how to integrate human knowledge into reinforcement learning for operations research by using action masking to constrain or guide policy learning. By formalizing action masks and their combinations, and applying them to paint shop scheduling, peak load management, and inventory management, the study demonstrates substantial performance gains and faster learning when human heuristics or provably optimal actions are enforced. However, it also shows risks: overly restrictive masks can hinder exploration and degrade performance in some settings. The results suggest action masking as a practical tool to improve trust, safety, and adoption of RL in real-world OR problems, with potential for post-hoc adjustments and hybrid approaches with reward shaping.

Abstract

Reinforcement learning (RL) provides a powerful method to address problems in operations research. However, its real-world application often fails due to a lack of user acceptance and trust. A possible remedy is to provide managers with the possibility of altering the RL policy by incorporating human expert knowledge. In this study, we analyze the benefits and caveats of including human knowledge via action masking. While action masking has so far been used to exclude invalid actions, its ability to integrate human expertise remains underexplored. Human knowledge is often encapsulated in heuristics, which suggest reasonable, near-optimal actions in certain situations. Enforcing such actions should hence increase trust among the human workforce to rely on the model's decisions. Yet, a strict enforcement of heuristic actions may also restrict the policy from exploring superior actions, thereby leading to overall lower performance. We analyze the effects of action masking based on three problems with different characteristics, namely, paint shop scheduling, peak load management, and inventory management. Our findings demonstrate that incorporating human knowledge through action masking can achieve substantial improvements over policies trained without action masking. In addition, we find that action masking is crucial for learning effective policies in constrained action spaces, where certain actions can only be performed a limited number of times. Finally, we highlight the potential for suboptimal outcomes when action masks are overly restrictive.

Paper Structure

This paper contains 17 sections, 31 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Paint shop problem with a 4x5 buffer (four lanes of width five). The system retrieves from lane 4 without causing a color change.
  • Figure 2: Illustration of the four considered action masks.
  • Figure 3: Evaluation results (color changes) for all RL approaches and Greedy heuristic.
  • Figure 4: Learning curves for RL models with 10 colors and 4x4 buffer and varying action masks.
  • Figure 5: Load curve for load management system over 96 timesteps.
  • ...and 2 more figures

Theorems & Definitions (1)

  • definition thmcounterdefinition: Markov decision process