Table of Contents
Fetching ...

Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch

TL;DR

The paper tackles intraday operating-room scheduling under uncertainty by modeling the system as a cooperative Markov game where each OR is an agent. It introduces a centralized-training, decentralized-execution MARL framework with a shared PPO policy and a within-epoch sequential assignment that guarantees conflict-free joint actions. The reward unifies throughput, timeliness, and overtime, incorporating setup times and a pre-day MIP reference schedule, and is optimized in a realistic six-OR, eight-surgery-type simulator. Empirical results show that MARL outperforms multiple heuristics and approaches the ex post MIP oracle, with interpretable patterns such as batching, length-aware sequencing, and adaptive prioritization. The work provides theoretical insights on CTDE optimality and suboptimality bounds and discusses practical deployment considerations and future extensions to heterogeneous resources and constrained objectives.

Abstract

Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.

Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

TL;DR

The paper tackles intraday operating-room scheduling under uncertainty by modeling the system as a cooperative Markov game where each OR is an agent. It introduces a centralized-training, decentralized-execution MARL framework with a shared PPO policy and a within-epoch sequential assignment that guarantees conflict-free joint actions. The reward unifies throughput, timeliness, and overtime, incorporating setup times and a pre-day MIP reference schedule, and is optimized in a realistic six-OR, eight-surgery-type simulator. Empirical results show that MARL outperforms multiple heuristics and approaches the ex post MIP oracle, with interpretable patterns such as batching, length-aware sequencing, and adaptive prioritization. The work provides theoretical insights on CTDE optimality and suboptimality bounds and discusses practical deployment considerations and future extensions to heterogeneous resources and constrained objectives.

Abstract

Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.

Paper Structure

This paper contains 57 sections, 4 theorems, 22 equations, 3 figures, 5 tables, 2 algorithms.

Key Result

Lemma 1

Fix an epoch $t$ and assume: (A1) no sequence-dependent setups ($\sigma_{k'\to k}\equiv 0$); (A2) starting a case at $t$ affects only its own queue (OR-separable transitions); (A3) immediate rewards add across ORs (no cross-OR penalties); and (A4) no contention for a single remaining patient in any

Figures (3)

  • Figure 1: Radar charts comparing the performance of alternative scheduling methods.
  • Figure 2: Representative Gantt charts illustrating learned scheduling behavior under different operating conditions: (a) a regular non-emergency day, (b) a day with an emergency batch arriving late ($t=66$), and (c) a day with an early emergency batch ($t=1$) occurring while surgeries are in progress, constraining the MARL policy and leading to longer delays.
  • Figure 3: Comparison of daily schedules produced by the MARL (top) and the ex post MIP oracle (bottom) under identical arrival sequences.

Theorems & Definitions (8)

  • Lemma 1: Equivalence under weak coupling
  • proof
  • Proposition 2: Optimality of CTDE under Weak Coupling
  • proof
  • Theorem 3: Suboptimality bound
  • proof
  • Corollary 4: Sequential policy gap for one urgent case
  • proof