Table of Contents
Fetching ...

DenoMamba: A fused state-space model for low-dose CT denoising

Şaban Öztürk, Oğuz Can Duran, Tolga Çukur

TL;DR

This work tackles low-dose CT denoising by introducing DenoMamba, a fused state-space model that jointly captures spatial and channel context through novel FuseSSM blocks within an hourglass encoder–decoder. By integrating a spatial SSM with a channel SSM augmented by a gated convolution, along with an identity path and a convolutional fusion module, the method preserves fine spatial details while leveraging long-range dependencies. Across 25% and 10% dose LDCT datasets, DenoMamba outperforms CNN, GAN, diffusion, and transformer-based baselines in PSNR, SSIM, and RMSE, and demonstrates robust generalization to cross-domain and dose-shift scenarios. The results highlight the practical potential of purely SSM-based denoising for high-fidelity LDCT restoration, with ablations confirming the necessity of each architectural component and fusion strategy.

Abstract

Low-dose computed tomography (LDCT) lower potential risks linked to radiation exposure while relying on advanced denoising algorithms to maintain diagnostic quality in reconstructed images. The reigning paradigm in LDCT denoising is based on neural network models that learn data-driven image priors to separate noise evoked by dose reduction from underlying tissue signals. Naturally, the fidelity of these priors depend on the model's ability to capture the broad range of contextual features evident in CT images. Earlier convolutional neural networks (CNN) are highly adept at efficiently capturing short-range spatial context, but their limited receptive fields reduce sensitivity to interactions over longer distances. Although transformers based on self-attention mechanisms have recently been posed to increase sensitivity to long-range context, they can suffer from suboptimal performance and efficiency due to elevated model complexity, particularly for high-resolution CT images. For high-quality restoration of LDCT images, here we introduce DenoMamba, a novel denoising method based on state-space modeling (SSM), that efficiently captures short- and long-range context in medical images. Following an hourglass architecture with encoder-decoder stages, DenoMamba employs a spatial SSM module to encode spatial context and a novel channel SSM module equipped with a secondary gated convolution network to encode latent features of channel context at each stage. Feature maps from the two modules are then consolidated with low-level input features via a convolution fusion module (CFM). Comprehensive experiments on LDCT datasets with 25\% and 10\% dose reduction demonstrate that DenoMamba outperforms state-of-the-art denoisers with average improvements of 1.4dB PSNR, 1.1% SSIM, and 1.6% RMSE in recovered image quality.

DenoMamba: A fused state-space model for low-dose CT denoising

TL;DR

This work tackles low-dose CT denoising by introducing DenoMamba, a fused state-space model that jointly captures spatial and channel context through novel FuseSSM blocks within an hourglass encoder–decoder. By integrating a spatial SSM with a channel SSM augmented by a gated convolution, along with an identity path and a convolutional fusion module, the method preserves fine spatial details while leveraging long-range dependencies. Across 25% and 10% dose LDCT datasets, DenoMamba outperforms CNN, GAN, diffusion, and transformer-based baselines in PSNR, SSIM, and RMSE, and demonstrates robust generalization to cross-domain and dose-shift scenarios. The results highlight the practical potential of purely SSM-based denoising for high-fidelity LDCT restoration, with ablations confirming the necessity of each architectural component and fusion strategy.

Abstract

Low-dose computed tomography (LDCT) lower potential risks linked to radiation exposure while relying on advanced denoising algorithms to maintain diagnostic quality in reconstructed images. The reigning paradigm in LDCT denoising is based on neural network models that learn data-driven image priors to separate noise evoked by dose reduction from underlying tissue signals. Naturally, the fidelity of these priors depend on the model's ability to capture the broad range of contextual features evident in CT images. Earlier convolutional neural networks (CNN) are highly adept at efficiently capturing short-range spatial context, but their limited receptive fields reduce sensitivity to interactions over longer distances. Although transformers based on self-attention mechanisms have recently been posed to increase sensitivity to long-range context, they can suffer from suboptimal performance and efficiency due to elevated model complexity, particularly for high-resolution CT images. For high-quality restoration of LDCT images, here we introduce DenoMamba, a novel denoising method based on state-space modeling (SSM), that efficiently captures short- and long-range context in medical images. Following an hourglass architecture with encoder-decoder stages, DenoMamba employs a spatial SSM module to encode spatial context and a novel channel SSM module equipped with a secondary gated convolution network to encode latent features of channel context at each stage. Feature maps from the two modules are then consolidated with low-level input features via a convolution fusion module (CFM). Comprehensive experiments on LDCT datasets with 25\% and 10\% dose reduction demonstrate that DenoMamba outperforms state-of-the-art denoisers with average improvements of 1.4dB PSNR, 1.1% SSIM, and 1.6% RMSE in recovered image quality.
Paper Structure (21 sections, 15 equations, 5 figures, 7 tables)

This paper contains 21 sections, 15 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overall architecture of DenoMamba. The proposed model comprises encoder-decoder stages that are residually connected with long skip connections. In the encoder stages, input feature maps are projected through cascaded FuseSSM blocks, and spatially downsampled while the channel dimensionality is increased. In the decoder stages, input feature maps are back-projected through cascaded FuseSSM blocks, and spatially upsampled while the channel dimensionality is reduced. The proposed FuseSSM blocks use a spatial SSM module to extract spatial context, a novel channel SSM module to extract channel context, and an identity path to propagate low-level spatial features. Afterwards, low-level spatial features and their spatial- and channel-wise contextualized representations are aggregated across a convolutional fusion module (CFM).
  • Figure 2: Inner modules of the FuseSSM blocks. Each FuseSSM block comprises a channel SSM module, a spatial SSM module, an identity propagation path, and a CFM module. The channel SSM module performs convolutional encoding of image tokens after layer normalization, and processes the transposed feature map via an SSM layer to capture an initial set of contextual features across the channel dimension. To further extract higher-order latent features, this initial set is projected through a gated convolutional network, and the two sets of contextual features are residually combined. The spatial SSM module performs convolutional encoding of image tokens after layer normalization, and processes the feature map via an SSM layer to capture contextual features across the spatial dimension. The CFM module pools low-level features propagated by the identity path with contextual features from the channel and spatial SSM modules, and nonlinearly fuses them via convolutional layers.
  • Figure 3: Denoising results from the 25%-dose AAPM dataset are depicted for representative cross-sections. Images recovered by competing methods are shown along with the LDCT image (i.e., model input), and the NDCT image (i.e., ground truth). Zoom-in displays and arrows are used to showcase regions with visible differences in image quality among competing methods. Display windows of [-150 350] HU are used.
  • Figure 4: Denoising results from the 10%-dose AAPM dataset are depicted for representative cross-sections. Display windows of [-350 350] HU are used.
  • Figure 5: Denoising results for representative cross-sections from the experiments conducted to assess model generalization. a) Models trained on the 25%-dose AAPM dataset were evaluated on the 10%-dose Piglet CT dataset. b) For the AAPM dataset, models trained on 25%-dose scans were evaluated on 10%-dose scans. Display windows of a) [-400 1000] HU and b) [-250 450] HU are used.