Table of Contents
Fetching ...

SurGen: Text-Guided Diffusion Model for Surgical Video Generation

Joseph Cho, Samuel Schmidgall, Cyril Zakka, Mrudang Mathur, Dhamanpreet Kaur, Rohan Shad, William Hiesinger

TL;DR

SurGen tackles the need for realistic, text-controlled surgical video generation to enhance training. It leverages a CogVideoX-based diffusion framework with a 3D VAE, a 3D transformer, and a T5 text encoder to produce high-resolution ($720×480$) videos of $49$ frames guided by phase-specific prompts. The approach demonstrates substantially improved $FID$ and $FVD$ scores and stronger phase-alignment relative to real data and a baseline, indicating improved visual fidelity, temporal dynamics, and conceptual correctness. This work suggests diffusion-based surgical simulators can provide diverse, scalable educational content, with future directions including expanding datasets, adding kinematic conditioning, and pursuing real-time generation for fully immersive training.

Abstract

Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

SurGen: Text-Guided Diffusion Model for Surgical Video Generation

TL;DR

SurGen tackles the need for realistic, text-controlled surgical video generation to enhance training. It leverages a CogVideoX-based diffusion framework with a 3D VAE, a 3D transformer, and a T5 text encoder to produce high-resolution () videos of frames guided by phase-specific prompts. The approach demonstrates substantially improved and scores and stronger phase-alignment relative to real data and a baseline, indicating improved visual fidelity, temporal dynamics, and conceptual correctness. This work suggests diffusion-based surgical simulators can provide diverse, scalable educational content, with future directions including expanding datasets, adding kinematic conditioning, and pursuing real-time generation for fully immersive training.

Abstract

Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
Paper Structure (12 sections, 4 figures)

This paper contains 12 sections, 4 figures.

Figures (4)

  • Figure 1: Series of videos generated by SurGen, a 2 billion parameter text-guided diffusion model adapted to laparoscopic cholecystectomy procedures. The text prompts for the corresponding videos were formatted as "Laparoscopic cholecystectomy during {surgical phase}".
  • Figure 2: The text-to-video process of SurGen, our large video LDM adapted from CogVideoX. The text prompt is processed by a T5 text encoder to create a semantic representation. The diffusion transformer takes in Gaussian noise, and uses the text encoding to help guide the denoising process. The resulting denoised output is then decoded by the 3D VAE into a high-quality surgical video.
  • Figure 3: High-resolution frames from the surgical videos synthesized by SurGen. The model generates videos at 720 x 480 pixels (width × height).
  • Figure 4: Evaluation results of a 3D ResNet18, trained on the last 40 videos of Cholec80, show higher top-1 accuracy and AUROC in classifying surgical phases for SurGen-synthesized videos compared to the first 40 videos of Cholec80 (used to train SurGen).