Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty
Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch
TL;DR
The paper tackles intraday operating-room scheduling under uncertainty by modeling the system as a cooperative Markov game where each OR is an agent. It introduces a centralized-training, decentralized-execution MARL framework with a shared PPO policy and a within-epoch sequential assignment that guarantees conflict-free joint actions. The reward unifies throughput, timeliness, and overtime, incorporating setup times and a pre-day MIP reference schedule, and is optimized in a realistic six-OR, eight-surgery-type simulator. Empirical results show that MARL outperforms multiple heuristics and approaches the ex post MIP oracle, with interpretable patterns such as batching, length-aware sequencing, and adaptive prioritization. The work provides theoretical insights on CTDE optimality and suboptimality bounds and discusses practical deployment considerations and future extensions to heterogeneous resources and constrained objectives.
Abstract
Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.
