Table of Contents
Fetching ...

Towards Controllable Video Synthesis of Routine and Rare OR Events

Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova, Yiqing Shen, Jan Emily Mangulabnan, Hao Ding, Jose L. Porras, Masaru Ishii, Mathias Unberath

TL;DR

This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events from abstract geometric representations and shows its potential to support the development of ambient intelligence models.

Abstract

Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.

Towards Controllable Video Synthesis of Routine and Rare OR Events

TL;DR

This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events from abstract geometric representations and shows its potential to support the development of ambient intelligence models.

Abstract

Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.
Paper Structure (7 sections, 6 figures, 4 tables)

This paper contains 7 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: abstract, geometry conditioned OR diffusion framework consists of three main modules: (i) Geometric Abstraction Module converts the initial OR scene into an abstract geometric scene representation using ellipsoids. (ii) Conditioning Module generates temporal sequences of abstract geometric scenes through two pathways: from routine OR events (blue dash path), or from incorporating user-defined trajectories (dashdotted green path). (iii) Diffusion Module synthesizes videos of OR events conditioned on the initial scene and the geometric sequences.
  • Figure 2: Geometric abstraction module pipeline: Given an initial scene and segmentation point prompts, SAM2ravi2024sam propagates instance segmentation masks across video. Depth information is estimated using Video Depth Anythingchen2025video. Each segmented instance is then approximated by an ellipsoid parameterized by its centroid position and spatial spread (height, width, rotation angle). The resulting Abstract Geometric Scene Representation encodes class information in the red and green channels, combined with normalized relative depth in the blue channel intensity.
  • Figure 3: Interactive conditioning module for counterfactual event generation. Given an input OR video sequence, the Abstraction module converts the scene into an abstract geometric representation. A graphical user interface enables direct manipulation of these ellipsoids through drag-and-drop operations to sketch desired trajectories. The Conditioning Module transforms the original geometric sequence into a counterfactual event by incorporating the user-modified trajectories.
  • Figure 4: Qualitative comparison of video synthesis methods on out-of-domain (4DOR) dataset. Groundtruth: Original video frames to reconstruct. WAN wan2025wan & LTX-Base: Text-conditioned generation using VLM descriptions of the groundtruth scene. SVD: Image-to-video generation with low dynamic motion setting. Ours: Our proposed video synthesis using abstract geometric representation.
  • Figure 5: Controllable synthesis of safety-critical, interactions, and alternate OR events. Each column pair shows a routine OR event (left) with its abstract geometric representation (top), and a counterfactual event (right) generated by providing a trajectory for geometric conditioning. Left pair (safety-critical event): A non-sterile assistant approaches the sterile instrument table. Middle pair (Interaction): Personnel walking toward and reaching for interaction with the table. Right pair (Alternate event): Modified trajectory where personnel walks directly toward the patient bed instead of the original path around the room.
  • ...and 1 more figures