Table of Contents
Fetching ...

Soft Masked Mamba Diffusion Model for CT to MRI Conversion

Zhenbin Wang, Lei Zhang, Lituan Wang, Zhenwei Zhang

TL;DR

This work tackles CT-to-MRI conversion by proposing Diffusion Mamba (DiffMa), a latent diffusion model built on a Mamba State-Space backbone that operates on 2D latent patches. It introduces Spiral-Scan to preserve 2D spatial continuity and a soft-mask Cross-Sequence Attention mechanism via a Vision Embedder to leverage CT priors and emphasize diagnostically relevant regions. Empirical results on SynthRAD2023 pelvis and brain data show DiffMa achieving superior SSIM and PSNR with efficient, linear-complexity computation compared to CNN/ViT baselines and other Mamba variants, underscoring its potential for cost-effective medical imaging. The approach combines a CT-conditioned diffusion framework with cross-sequence supervision and latent-space processing, enabling accurate MR generation while maintaining computational efficiency and scalability for clinical deployment.

Abstract

Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) are the predominant modalities utilized in the field of medical imaging. Although MRI capture the complexity of anatomical structures with greater detail than CT, it entails a higher financial costs and requires longer image acquisition times. In this study, we aim to train latent diffusion model for CT to MRI conversion, replacing the commonly-used U-Net or Transformer backbone with a State-Space Model (SSM) called Mamba that operates on latent patches. First, we noted critical oversights in the scan scheme of most Mamba-based vision methods, including inadequate attention to the spatial continuity of patch tokens and the lack of consideration for their varying importance to the target task. Secondly, extending from this insight, we introduce Diffusion Mamba (DiffMa), employing soft masked to integrate Cross-Sequence Attention into Mamba and conducting selective scan in a spiral manner. Lastly, extensive experiments demonstrate impressive performance by DiffMa in medical image generation tasks, with notable advantages in input scaling efficiency over existing benchmark models. The code and models are available at https://github.com/wongzbb/DiffMa-Diffusion-Mamba

Soft Masked Mamba Diffusion Model for CT to MRI Conversion

TL;DR

This work tackles CT-to-MRI conversion by proposing Diffusion Mamba (DiffMa), a latent diffusion model built on a Mamba State-Space backbone that operates on 2D latent patches. It introduces Spiral-Scan to preserve 2D spatial continuity and a soft-mask Cross-Sequence Attention mechanism via a Vision Embedder to leverage CT priors and emphasize diagnostically relevant regions. Empirical results on SynthRAD2023 pelvis and brain data show DiffMa achieving superior SSIM and PSNR with efficient, linear-complexity computation compared to CNN/ViT baselines and other Mamba variants, underscoring its potential for cost-effective medical imaging. The approach combines a CT-conditioned diffusion framework with cross-sequence supervision and latent-space processing, enabling accurate MR generation while maintaining computational efficiency and scalability for clinical deployment.

Abstract

Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) are the predominant modalities utilized in the field of medical imaging. Although MRI capture the complexity of anatomical structures with greater detail than CT, it entails a higher financial costs and requires longer image acquisition times. In this study, we aim to train latent diffusion model for CT to MRI conversion, replacing the commonly-used U-Net or Transformer backbone with a State-Space Model (SSM) called Mamba that operates on latent patches. First, we noted critical oversights in the scan scheme of most Mamba-based vision methods, including inadequate attention to the spatial continuity of patch tokens and the lack of consideration for their varying importance to the target task. Secondly, extending from this insight, we introduce Diffusion Mamba (DiffMa), employing soft masked to integrate Cross-Sequence Attention into Mamba and conducting selective scan in a spiral manner. Lastly, extensive experiments demonstrate impressive performance by DiffMa in medical image generation tasks, with notable advantages in input scaling efficiency over existing benchmark models. The code and models are available at https://github.com/wongzbb/DiffMa-Diffusion-Mamba
Paper Structure (16 sections, 10 equations, 7 figures, 2 tables)

This paper contains 16 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Visualization of the significance of each 14x14 patch from two latent pelvic images. The size and darkness of the circles denote the level of importance, with larger and darker circles indicating greater significance. The weights are derived from the pre-trained vision Embedder.
  • Figure 2: The Diffusion Mamba (DiffMa) framework. Left: The overall framework of Diffusion. We use long skip connection to prevent Nan. Middle: Details of Mamba blocks, consisting of two branches that use adaptive layer norms (AdaLN) to incorporate conditioning, and we introduce innovative soft mask to provide prior knowledge for sequences. Right: Details of Mamba, where we employ Spiral-Scan to focus on the structural information.
  • Figure 3: Framework of Vision Embedder. Unlike MRI patch tokens, CT patch tokens are trained without positional or temporal embeddings.
  • Figure 4: The 2D Image Spiral-Scan. There are eight schemes in total, each contains two modes, and every block employs one of these schemes.
  • Figure 5: Visualizations of 14 brain CT to MRI conversion pairs from the SynthRAD2023 dataset. Among each pair, Left is the inputed CT, Middle is the generated MRI and Right is the ground truth MRI. Zoom in for a better view.
  • ...and 2 more figures