Table of Contents
Fetching ...

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

Yanxin Li, Hui Wan, Libin Lan

TL;DR

DCAU-Net is proposed, a novel yet efficient segmentation framework with two key ideas, designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures and introduce a Channel-Spatial Feature Fusion strategy to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention.

Abstract

Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

TL;DR

DCAU-Net is proposed, a novel yet efficient segmentation framework with two key ideas, designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures and introduce a Channel-Spatial Feature Fusion strategy to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention.

Abstract

Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.
Paper Structure (16 sections, 11 equations, 4 figures, 5 tables)

This paper contains 16 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Details of the differential cross attention. It performs efficient cross attention between pixel-wise queries and window-level key–value pairs via differential attention, suppressing redundancy and enhancing focus on discriminative structures.
  • Figure 2: Details of the DCA Block, consisting of a 3$\times$3 depth-wise convolution, a DCA module, and a 2-layer MLP.
  • Figure 3: Overall architecture of the proposed DCAU-Net. The network adopts a U-shaped encoder-decoder framework with four hierarchical stages. The encoder integrates DCA blocks, centered on the differential cross attention. Features from the encoder are transferred to the decoder via skip connections and adaptively fused with those from previous decoder layers through CSFF blocks to enhance segmentation accuracy.
  • Figure 4: Qualitative comparisons of our approach against other state-of-the-art methods on the Synapse dataset.