Interactive Generation of Laparoscopic Videos with Diffusion Models
Ivan Iliash, Simeon Allmendinger, Felix Meissen, Niklas Kühl, Daniel Rückert
TL;DR
This work introduces a diffusion-model-based framework for interactive laparoscopic video generation conditioned on text prompts and tool positions through segmentation masks. A three-stage pipeline—StableDiffusion finetuning, ControlNet conditioning on tool masks, and ControlVideo cross-frame inference—delivers high realism and controllable tool placement on the CholecT50/Cholec80 datasets, reporting FID around 38.097 and F1 around 0.71. The approach employs multiple fidelity and factual-correction metrics, including RDV action recognition, to demonstrate improved realism and action coherence, with qualitative evidence of coherent surgical sequences. Limitations include a remaining gap to real videos and room for improvements in 3D depth and camera motion to further enhance training effectiveness for surgical education.
Abstract
Generative AI, in general, and synthetic visual data generation, in specific, hold much promise for benefiting surgical training by providing photorealism to simulation environments. Current training methods primarily rely on reading materials and observing live surgeries, which can be time-consuming and impractical. In this work, we take a significant step towards improving the training process. Specifically, we use diffusion models in combination with a zero-shot video diffusion method to interactively generate realistic laparoscopic images and videos by specifying a surgical action through text and guiding the generation with tool positions through segmentation masks. We demonstrate the performance of our approach using the publicly available Cholec dataset family and evaluate the fidelity and factual correctness of our generated images using a surgical action recognition model as well as the pixel-wise F1-score for the spatial control of tool generation. We achieve an FID of 38.097 and an F1-score of 0.71.
