Table of Contents
Fetching ...

Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation

Zhenbin Wang, Lei Zhang, Lituan Wang, Minjuan Zhu, Zhenwei Zhang

TL;DR

This work proposes Medical Simulation Video Generator (MedSora), which incorporates three key elements: a video diffusion framework that integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation.

Abstract

Medical video generation models are expected to have a profound impact on the healthcare industry, including but not limited to medical education and training, surgical planning, and simulation. Current video diffusion models typically build on image diffusion architecture by incorporating temporal operations (such as 3D convolution and temporal attention). Although this approach is effective, its oversimplification limits spatio-temporal performance and consumes substantial computational resources. To counter this, we propose Medical Simulation Video Generator (MedSora), which incorporates three key elements: i) a video diffusion framework integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation, ii) an optical flow representation alignment method that implicitly enhances attention to inter-frame pixels, and iii) a video variational autoencoder (VAE) with frequency compensation addresses the information loss of medical features that occurs when transforming pixel space into latent features and then back to pixel frames. Extensive experiments and applications demonstrate that MedSora exhibits superior visual quality in generating medical videos, outperforming the most advanced baseline methods. Further results and code are available at https://wongzbb.github.io/MedSora

Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation

TL;DR

This work proposes Medical Simulation Video Generator (MedSora), which incorporates three key elements: a video diffusion framework that integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation.

Abstract

Medical video generation models are expected to have a profound impact on the healthcare industry, including but not limited to medical education and training, surgical planning, and simulation. Current video diffusion models typically build on image diffusion architecture by incorporating temporal operations (such as 3D convolution and temporal attention). Although this approach is effective, its oversimplification limits spatio-temporal performance and consumes substantial computational resources. To counter this, we propose Medical Simulation Video Generator (MedSora), which incorporates three key elements: i) a video diffusion framework integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation, ii) an optical flow representation alignment method that implicitly enhances attention to inter-frame pixels, and iii) a video variational autoencoder (VAE) with frequency compensation addresses the information loss of medical features that occurs when transforming pixel space into latent features and then back to pixel frames. Extensive experiments and applications demonstrate that MedSora exhibits superior visual quality in generating medical videos, outperforming the most advanced baseline methods. Further results and code are available at https://wongzbb.github.io/MedSora

Paper Structure

This paper contains 19 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of spatial attention, pseudo-3D or 2D+1D convolution, spatio-temporal attention, and our spatio-temporal Mamba. $\mathbf{x}^k$ represents the feature map of the $k$-th video frame. Patches marked with '*' are computed together with the colored patches to aggregate their features.
  • Figure 2: MedSora is built upon the video Mamba diffusion model, the optical flow representation alignment component, and the frequency compensation video VAE. The Video Mamba diffusion model enables video generation under low resource load; the optical flow representation alignment aims to produce smoother videos and accelerate model convergence; the frequency compensation video VAE addresses potential inconsistencies during the video reconstruction process.
  • Figure 3: We present the structural details of spatial Mamba and temporal Mamba, both of which adopt the scalar state-space model. Spatial Mamba employ a bidirectional spiral scanning scheme to emphasize spatial continuity, while temporal Mamba scan along the frame axis.
  • Figure 4: The proposed frequency compensation video VAE. Current video diffusion models either rely on image VAEs for encoding and reconstructing videos, potentially omitting crucial temporal information and resulting in inadequate compression rates, or rely on video VAEs that lack training on medical videos. The proposed frequency compensation video VAE introduces frequency compensation components with two distinct structures, built on the 3D causal VAE, ensuring temporal consistency while focusing on the structural information of medical videos, which is instrumental in identifying areas such as lesions and textures.
  • Figure 5: Qualitative comparison between MedSora and other video generation models. The generated videos consist of $16$ frames, from which $10$ consecutive frames starting at $t=0$ are selected for demonstration. For a more comprehensive comparison of qualitative results, please refer to the generated videos on our project webpage.
  • ...and 1 more figures