Table of Contents
Fetching ...

Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins

Yiqing Shen, Chenjia Li, Bohan Liu, Cheng-Yi Li, Tito Porras, Mathias Unberath

TL;DR

This work addresses the need for flexible, open-set operating room workflow analysis beyond closed-set, end-to-end models. It introduces ORDiRS, a tuning-free framework that uses a structured OR Digital Twin to preserve semantic and spatial relationships and a three-stage reasoning pipeline (reason-retrieve-synthesize) driven by an LLM, plus ORDiRS-Agent for query-driven analysis. On in-house and MOVR-Reason datasets, ORDiRS achieves notable improvements in cIoU and gIoU over prior methods, albeit with higher inference time, making it well-suited for offline workflow analysis. The approach decouples perception from reasoning, enabling robust cross-site analysis without continual fine-tuning and offering potential extensions to temporal pattern mining and other healthcare contexts.

Abstract

Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.

Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins

TL;DR

This work addresses the need for flexible, open-set operating room workflow analysis beyond closed-set, end-to-end models. It introduces ORDiRS, a tuning-free framework that uses a structured OR Digital Twin to preserve semantic and spatial relationships and a three-stage reasoning pipeline (reason-retrieve-synthesize) driven by an LLM, plus ORDiRS-Agent for query-driven analysis. On in-house and MOVR-Reason datasets, ORDiRS achieves notable improvements in cIoU and gIoU over prior methods, albeit with higher inference time, making it well-suited for offline workflow analysis. The approach decouples perception from reasoning, enabling robust cross-site analysis without continual fine-tuning and offering potential extensions to temporal pattern mining and other healthcare contexts.

Abstract

Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.

Paper Structure

This paper contains 11 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the ORDiRS framework. The pipeline consists of two main components. (1) DT representation construction: Processing raw OR video frames through multiple vision foundation models, culminating in a structured JSON; (2) RS stage: Implementing a three-stage "reason-retrieve-synthesize" paradigm where LLM-based reasoning decomposes an implicit text query into atomic reasoning requirements.
  • Figure 2: Visualization of the OR reasoning segmentation in-house benchmark dataset. (a) Sample video frame sequence from case ID 22 demonstrating paired video frames, segmentation masks, and corresponding spatial/semantic implicit text queries. (b) Query type distribution per dataset split showing balanced representation. (c) Dataset split proportions across train, validation, and test datasets. (d) Overall equal distribution between spatial and semantic queries.
  • Figure 3: Qualitative comparison of RS results. Two representative cases are shown: a semantic RS task (top row) requiring identification of a patient with specific clothing attributes, and a spatial RS task (bottom row) involving positional understanding. White regions represent the segmentation masks, with ground truth shown in the leftmost column.
  • Figure 4: A case study for the workflow of ORDiRS-Agent for analyzing operating room efficiency. The process begins with a user query about surgical phase transitions, followed by the identification of key efficiency aspects. It then generates targeted reasoning segmentation sub-queries (Step 2), performs reasoning segmentation using ORDiRS (Step 3), and concludes with result analysis (Step 4). The visualization demonstrates how ORDiRS-Agent tracks critical workflow events across frames, including staff arrival (Frame 17), patient positioning (Frame 41), transition to transfer (Frame 273), and closure phase initiation (Frame 325).