Table of Contents
Fetching ...

Interactive Generation of Laparoscopic Videos with Diffusion Models

Ivan Iliash, Simeon Allmendinger, Felix Meissen, Niklas Kühl, Daniel Rückert

TL;DR

This work introduces a diffusion-model-based framework for interactive laparoscopic video generation conditioned on text prompts and tool positions through segmentation masks. A three-stage pipeline—StableDiffusion finetuning, ControlNet conditioning on tool masks, and ControlVideo cross-frame inference—delivers high realism and controllable tool placement on the CholecT50/Cholec80 datasets, reporting FID around 38.097 and F1 around 0.71. The approach employs multiple fidelity and factual-correction metrics, including RDV action recognition, to demonstrate improved realism and action coherence, with qualitative evidence of coherent surgical sequences. Limitations include a remaining gap to real videos and room for improvements in 3D depth and camera motion to further enhance training effectiveness for surgical education.

Abstract

Generative AI, in general, and synthetic visual data generation, in specific, hold much promise for benefiting surgical training by providing photorealism to simulation environments. Current training methods primarily rely on reading materials and observing live surgeries, which can be time-consuming and impractical. In this work, we take a significant step towards improving the training process. Specifically, we use diffusion models in combination with a zero-shot video diffusion method to interactively generate realistic laparoscopic images and videos by specifying a surgical action through text and guiding the generation with tool positions through segmentation masks. We demonstrate the performance of our approach using the publicly available Cholec dataset family and evaluate the fidelity and factual correctness of our generated images using a surgical action recognition model as well as the pixel-wise F1-score for the spatial control of tool generation. We achieve an FID of 38.097 and an F1-score of 0.71.

Interactive Generation of Laparoscopic Videos with Diffusion Models

TL;DR

This work introduces a diffusion-model-based framework for interactive laparoscopic video generation conditioned on text prompts and tool positions through segmentation masks. A three-stage pipeline—StableDiffusion finetuning, ControlNet conditioning on tool masks, and ControlVideo cross-frame inference—delivers high realism and controllable tool placement on the CholecT50/Cholec80 datasets, reporting FID around 38.097 and F1 around 0.71. The approach employs multiple fidelity and factual-correction metrics, including RDV action recognition, to demonstrate improved realism and action coherence, with qualitative evidence of coherent surgical sequences. Limitations include a remaining gap to real videos and room for improvements in 3D depth and camera motion to further enhance training effectiveness for surgical education.

Abstract

Generative AI, in general, and synthetic visual data generation, in specific, hold much promise for benefiting surgical training by providing photorealism to simulation environments. Current training methods primarily rely on reading materials and observing live surgeries, which can be time-consuming and impractical. In this work, we take a significant step towards improving the training process. Specifically, we use diffusion models in combination with a zero-shot video diffusion method to interactively generate realistic laparoscopic images and videos by specifying a surgical action through text and guiding the generation with tool positions through segmentation masks. We demonstrate the performance of our approach using the publicly available Cholec dataset family and evaluate the fidelity and factual correctness of our generated images using a surgical action recognition model as well as the pixel-wise F1-score for the spatial control of tool generation. We achieve an FID of 38.097 and an F1-score of 0.71.
Paper Structure (13 sections, 4 figures, 1 table)

This paper contains 13 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of our approach. Consisting of three stages: StableDiffusionfinetuning to adapt the text-to-image model to the laparoscopic domain. ControlNettraining for adding conditioning with tool segmentation masks. ControlVideoinference for controlled video generation.
  • Figure 2: Exemplary prompts used to condition the text-to-image model. The formats from the first three examples are only available in the CholecT50 dataset, as verb and target labels are missing in the Cholec80 dataset.
  • Figure 3: Actual images from the CholecT50 dataset (a) vs. images generated by our finetuned StableDiffusion model (b).
  • Figure 4: Generated images/video-frames for the given text prompts and tool masks. Displayed w.r.t the temporal dimension from left to right. First row: tool conditioning (input), second row: ControlNet output, third row: ControlVideo output