TransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation
Bobby Azad, Pourya Adibfar, Kaiqun Fu
TL;DR
Medical image segmentation requires accurate delineation of multi-scale anatomical structures, a task where CNNs excel at local detail but struggle with global context, while pure Transformers capture long-range relations yet can miss fine-grained localization. TransDAE addresses this gap with a hierarchical U-Net–like Transformer that incorporates a Dual Attention Transformer Block for simultaneous spatial and channel modeling and an Inter-Scale Interaction Module to enhance skip-connection fusion; efficient self-attention and spatial-reduction strategies control computational burden. Key contributions include (1) a dual spatial-channel attention mechanism, (2) an Efficient Attention-based channel pathway, (3) the Inter-Scale Interaction Module that decomposes large-kernel operations to fuse encoder–decoder features, and (4) strong state-of-the-art performance on the Synapse multi-organ dataset without pretraining (DSC = 82.16% with notable organ-wise gains). The approach demonstrates robust multi-scale segmentation with scalable computation, offering tangible benefits for CAD and treatment planning in clinical workflows.
Abstract
In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.
