Table of Contents
Fetching ...

I2I-Mamba: Multi-modal medical image synthesis via selective state space modeling

Omer F. Atli, Bilal Kabas, Fuat Arslan, Arda C. Demirtas, Mahmut Yurt, Onat Dalmaz, Tolga Çukur

TL;DR

I2I-Mamba introduces a dual-domain state-space model for cross-modal medical image synthesis, combining image- and Fourier-domain SSM branches with spiral-scan tokenization and channel mixing to capture both short- and long-range context while preserving spatial detail. The architecture, featuring a high-resolution bottleneck with ddMamba blocks and residual CNNs, outperforms CNN, transformer, and prior SSM baselines across multi-contrast MRI and MRI-CT translation tasks, demonstrated on IXI, BraTS, and MRI-CT datasets. Ablation studies validate the contribution of each component, including the spiral-scan SSM and dual-domain processing, to improvements in PSNR and SSIM. The method offers a scalable, efficient solution for missing-modality imputation with potential clinical impact in reducing scan times, enabling safer imaging, and harmonizing large-scale datasets.

Abstract

Multi-modal medical image synthesis involves nonlinear transformation of tissue signals between source and target modalities, where tissues exhibit contextual interactions across diverse spatial distances. As such, the utility of a network architecture in synthesis depends on its ability to express the broad set of contextual features in medical images. Convolutional neural networks (CNNs) offer high local precision at the expense of poor sensitivity to long-range context. While transformers promise to alleviate this issue, they suffer from an unfavorable trade-off between sensitivity to long- versus short-range context due to the intrinsic complexity of attention filters. To effectively capture contextual features while avoiding the complexitydriven trade-offs, here we introduce a novel multi-modal synthesis method, I2I-Mamba, based on the state space modeling (SSM) framework. Focusing on high-level representations across a hybrid residual architecture, I2I-Mamba leverages novel dual-domain Mamba (ddMamba) blocks for complementary contextual modeling in image and Fourier domains, while maintaining spatial precision with convolutional layers. Diverting from conventional raster-scan trajectories, ddMamba leverages novel SSM operators based on a spiral-scan trajectory to learn context with enhanced angular isotropy and radial coverage, and a channel-mixing layer to aggregate context across the channel dimension. Comprehensive demonstrations on multi-contrast MRI and MRI-CT protocols indicate that I2I-Mamba outperforms state-of-the-art CNNs, transformers and SSMs.

I2I-Mamba: Multi-modal medical image synthesis via selective state space modeling

TL;DR

I2I-Mamba introduces a dual-domain state-space model for cross-modal medical image synthesis, combining image- and Fourier-domain SSM branches with spiral-scan tokenization and channel mixing to capture both short- and long-range context while preserving spatial detail. The architecture, featuring a high-resolution bottleneck with ddMamba blocks and residual CNNs, outperforms CNN, transformer, and prior SSM baselines across multi-contrast MRI and MRI-CT translation tasks, demonstrated on IXI, BraTS, and MRI-CT datasets. Ablation studies validate the contribution of each component, including the spiral-scan SSM and dual-domain processing, to improvements in PSNR and SSIM. The method offers a scalable, efficient solution for missing-modality imputation with potential clinical impact in reducing scan times, enabling safer imaging, and harmonizing large-scale datasets.

Abstract

Multi-modal medical image synthesis involves nonlinear transformation of tissue signals between source and target modalities, where tissues exhibit contextual interactions across diverse spatial distances. As such, the utility of a network architecture in synthesis depends on its ability to express the broad set of contextual features in medical images. Convolutional neural networks (CNNs) offer high local precision at the expense of poor sensitivity to long-range context. While transformers promise to alleviate this issue, they suffer from an unfavorable trade-off between sensitivity to long- versus short-range context due to the intrinsic complexity of attention filters. To effectively capture contextual features while avoiding the complexitydriven trade-offs, here we introduce a novel multi-modal synthesis method, I2I-Mamba, based on the state space modeling (SSM) framework. Focusing on high-level representations across a hybrid residual architecture, I2I-Mamba leverages novel dual-domain Mamba (ddMamba) blocks for complementary contextual modeling in image and Fourier domains, while maintaining spatial precision with convolutional layers. Diverting from conventional raster-scan trajectories, ddMamba leverages novel SSM operators based on a spiral-scan trajectory to learn context with enhanced angular isotropy and radial coverage, and a channel-mixing layer to aggregate context across the channel dimension. Comprehensive demonstrations on multi-contrast MRI and MRI-CT protocols indicate that I2I-Mamba outperforms state-of-the-art CNNs, transformers and SSMs.
Paper Structure (20 sections, 15 equations, 6 figures, 7 tables)

This paper contains 20 sections, 15 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Network architecture for I2I-Mamba. The proposed model comprises encoder, bottleneck, and decoder modules to synthesize target from source images. The encoder extracts high-level representations of the source image via convolutional layers. The bottleneck extracts task-relevant contextual information across spatial, frequency and channel dimensions with the aid of dual-domain Mamba (ddMamba) blocks (ddMamba$^\text{I}$: image domain, ddMamba$^\text{F}$: Fourier domain) comprising channel-mixing layers, and maintains high spatial precision with the aid of residual CNN blocks. The decoder back-projects the contextualized representations onto the target image via convolutional layers.
  • Figure 2: Footprints illustrating the spatial distribution of focus that each learning operator deploys (see colorbar), while seeking contextual interactions of a central pixel (orange dots). (a) Convolution operators in CNNs have localized footprints with heavy focus over a restricted neighborhood, compromising sensitivity to long-range contextual interactions. (b) Attention operators in transformers have non-local footprints that diffusely distribute focus over the image, compromising local precision. (c), (d) Conventional state-space operators in SSMs are based on multiple raster-scan trajectories with anisotropic footprints biased towards rectangular image axes, limiting sensitivity to interactions in non-axial orientations. (e) I2I-Mamba's state-space operator leverages a novel spiral-scan trajectory that attains a near-isotropic footprint with more uniform focus across orientations, maintaining an improve balance between long- versus short-range contextual interactions.
  • Figure 3: Representative results for T1, PD $\rightarrow$ T2 in IXI. Synthetic target images from competing methods are displayed along with source images and reference target images. Zoom-in windows and performance metrics are also included to highlight differences among methods.
  • Figure 4: Representative results for FLAIR $\rightarrow$ T2 in BraTS. Synthetic target images from competing methods are displayed along with source images and reference target images. Zoom-in windows and performance metrics are also included to highlight differences.
  • Figure 5: Representative results for T2$\rightarrow$ CT in the MRI-CT dataset. Synthetic target images from competing methods are displayed along with source images and reference target images. Zoom-in windows and performance metrics are included to highlight differences.
  • ...and 1 more figures