Table of Contents
Fetching ...

TransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation

Bobby Azad, Pourya Adibfar, Kaiqun Fu

TL;DR

Medical image segmentation requires accurate delineation of multi-scale anatomical structures, a task where CNNs excel at local detail but struggle with global context, while pure Transformers capture long-range relations yet can miss fine-grained localization. TransDAE addresses this gap with a hierarchical U-Net–like Transformer that incorporates a Dual Attention Transformer Block for simultaneous spatial and channel modeling and an Inter-Scale Interaction Module to enhance skip-connection fusion; efficient self-attention and spatial-reduction strategies control computational burden. Key contributions include (1) a dual spatial-channel attention mechanism, (2) an Efficient Attention-based channel pathway, (3) the Inter-Scale Interaction Module that decomposes large-kernel operations to fuse encoder–decoder features, and (4) strong state-of-the-art performance on the Synapse multi-organ dataset without pretraining (DSC = 82.16% with notable organ-wise gains). The approach demonstrates robust multi-scale segmentation with scalable computation, offering tangible benefits for CAD and treatment planning in clinical workflows.

Abstract

In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.

TransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation

TL;DR

Medical image segmentation requires accurate delineation of multi-scale anatomical structures, a task where CNNs excel at local detail but struggle with global context, while pure Transformers capture long-range relations yet can miss fine-grained localization. TransDAE addresses this gap with a hierarchical U-Net–like Transformer that incorporates a Dual Attention Transformer Block for simultaneous spatial and channel modeling and an Inter-Scale Interaction Module to enhance skip-connection fusion; efficient self-attention and spatial-reduction strategies control computational burden. Key contributions include (1) a dual spatial-channel attention mechanism, (2) an Efficient Attention-based channel pathway, (3) the Inter-Scale Interaction Module that decomposes large-kernel operations to fuse encoder–decoder features, and (4) strong state-of-the-art performance on the Synapse multi-organ dataset without pretraining (DSC = 82.16% with notable organ-wise gains). The approach demonstrates robust multi-scale segmentation with scalable computation, offering tangible benefits for CAD and treatment planning in clinical workflows.

Abstract

In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.
Paper Structure (14 sections, 4 equations, 5 figures, 1 table)

This paper contains 14 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the Proposed Hierarchical Transformer Model. The model combines a U-Net-like structure with efficient dual attention mechanisms to achieve robust medical image segmentation. Starting with an input image $x^{H\times W \times C}$, the architecture tokenizes the input into overlapping patches. These tokens traverse through encoder modules that are made up of dual Transformer layers and patch merging functionality, enabling multi-scale hierarchical feature representation. During decoding, patch tokens are expanded and integrated with corresponding encoder features using a large-kernel attention module. This fusion process ensures better communication between the encoder and decoder components, with the final projection layer producing the output segmentation map.
  • Figure 2: Visual depiction of the integrated dual attention mechanism. (a) Illustrates the channel attention process, emphasizing efficient channel-specific representations. (b) Portrays the spatial attention component, underscoring its ability to discern contextual dependencies within the image. Combined, these components work harmoniously to refine medical image segmentation by concentrating on both spatial relations and informative channels.
  • Figure 3: Schematic representation of the Inter-scale Interaction Module. This module skillfully integrates the benefits of both convolution and self-attention, circumventing the limitations of each. The module incorporates local context information, expansive receptive fields, linear complexity, and dynamic processes, ensuring adaptability across both spatial and channel dimensions. A central element of the module is the attention map, emphasizing the significance of each feature. The figure delineates the decomposition of large kernel convolution operations, capturing long-distance associations with reduced computational overhead and fewer parameters, a pivotal innovation of the Inter-scale Interaction Module.
  • Figure 4: Segmentation comparisons on the Synapse dataset reveal that our suggested approach produces more refined and smooth borders for the stomach, spleen, and liver organs while also displaying fewer false positive prediction masks for the gallbladder in comparison to Swin-Unet and HiFormer. In the bottom row, the proposed method additionally demonstrates a reduced false positive area for the pancreas.
  • Figure 5: Visual representation of the attention map for the proposed model using Grad-CAM selvaraju2017grad on the Synapse dataset. The outcomes illustrate the efficiency of our approach in identifying large organs (liver, spleen, and stomach organs arranged from top to bottom), which demonstrates our method's proficiency in capturing long-range dependencies.