Endora: Video Generation Models as Endoscopy Simulators

Chenxin Li; Hengyu Liu; Yifan Liu; Brandon Y. Feng; Wuyang Li; Xinyu Liu; Zhen Chen; Jing Shao; Yixuan Yuan

Endora: Video Generation Models as Endoscopy Simulators

Chenxin Li, Hengyu Liu, Yifan Liu, Brandon Y. Feng, Wuyang Li, Xinyu Liu, Zhen Chen, Jing Shao, Yixuan Yuan

TL;DR

Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing, setting a substantial stage for further advances in medical content generation.

Abstract

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for machine learning. Despite progress in generating 2D medical images, the complex domain of clinical video generation has largely remained untapped.This paper introduces \model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes. We present a novel generative model design that integrates a meticulously crafted spatial-temporal video transformer with advanced 2D vision foundation model priors, explicitly modeling spatial-temporal dynamics during video generation. We also pioneer the first public benchmark for endoscopy simulation with video generation models, adapting existing state-of-the-art methods for this endeavor.Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing. Moreover, we explore how this endoscopy simulator can empower downstream video analysis tasks and even generate 3D medical scenes with multi-view consistency. In a nutshell, Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation. For more details, please visit our project page: https://endora-medvidgen.github.io/.

Endora: Video Generation Models as Endoscopy Simulators

TL;DR

Abstract

Paper Structure (13 sections, 2 equations, 3 figures, 3 tables)

This paper contains 13 sections, 2 equations, 3 figures, 3 tables.

Introduction
Method
Diffusion Model for Video Generation
Spatial-temporal Transformer
Prior-guided Feature Facilitation
Experiments
Experiment Settings
Comparison with State-of-the-arts
Further Empirical Studies
Conclusion
Acknowledgement
Dataset Setting
Hyperparameters

Figures (3)

Figure 1: Endora Training Overview. Starting from the noised input video sequences, the diffusion model iteratively removes the noise and recover the clean sequence. The long-range spatial-temporal dynamics is modeled by an interlaced cascading of several spatial-temporal Transformer blocks. We further instill the prior from 2D vision foundation model (DINO caron2021emerging) to guide feature extraction.
Figure 2: Qualitative Comparison on Kvasir-Capsule borgli2020hyperkvasir and Cholec nwoye2022rendezvous Datasets.
Figure 3: RGB and 3D Depth Reconstructed from Generated Videos.

Endora: Video Generation Models as Endoscopy Simulators

TL;DR

Abstract

Endora: Video Generation Models as Endoscopy Simulators

Authors

TL;DR

Abstract

Table of Contents

Figures (3)