Table of Contents
Fetching ...

D-TrAttUnet: Toward Hybrid CNN-Transformer Architecture for Generic and Subtle Segmentation in Medical Images

Fares Bougourzi, Fadi Dornaika, Cosimo Distante, Abdelmalik Taleb-Ahmed

TL;DR

D-TrAttUnet tackles the challenging problem of generic and subtle medical image segmentation by integrating a Transformer-based encoder with a CNN backbone in a single architecture. It features a hybrid encoder, a four-level Encoders Fusion Module that injects multi-scale Transformer features into CNN representations, and dual decoders with Attention Gates to concurrently segment lesions and organs, trained with a hybrid lesion/organ loss. The method achieves superior performance on Bone Metastasis and Covid-19 infection segmentation and demonstrates strong generalization on gland and nuclei segmentation through the hybrid encoder, with ablations confirming the importance of each component. The approach offers data-efficient, robust segmentation across tasks and provides practical inference efficiency, making it well-suited for clinical and research contexts, with code to be publicly released.

Abstract

Over the past two decades, machine analysis of medical imaging has advanced rapidly, opening up significant potential for several important medical applications. As complicated diseases increase and the number of cases rises, the role of machine-based imaging analysis has become indispensable. It serves as both a tool and an assistant to medical experts, providing valuable insights and guidance. A particularly challenging task in this area is lesion segmentation, a task that is challenging even for experienced radiologists. The complexity of this task highlights the urgent need for robust machine learning approaches to support medical staff. In response, we present our novel solution: the D-TrAttUnet architecture. This framework is based on the observation that different diseases often target specific organs. Our architecture includes an encoder-decoder structure with a composite Transformer-CNN encoder and dual decoders. The encoder includes two paths: the Transformer path and the Encoders Fusion Module path. The Dual-Decoder configuration uses two identical decoders, each with attention gates. This allows the model to simultaneously segment lesions and organs and integrate their segmentation losses. To validate our approach, we performed evaluations on the Covid-19 and Bone Metastasis segmentation tasks. We also investigated the adaptability of the model by testing it without the second decoder in the segmentation of glands and nuclei. The results confirmed the superiority of our approach, especially in Covid-19 infections and the segmentation of bone metastases. In addition, the hybrid encoder showed exceptional performance in the segmentation of glands and nuclei, solidifying its role in modern medical image analysis.

D-TrAttUnet: Toward Hybrid CNN-Transformer Architecture for Generic and Subtle Segmentation in Medical Images

TL;DR

D-TrAttUnet tackles the challenging problem of generic and subtle medical image segmentation by integrating a Transformer-based encoder with a CNN backbone in a single architecture. It features a hybrid encoder, a four-level Encoders Fusion Module that injects multi-scale Transformer features into CNN representations, and dual decoders with Attention Gates to concurrently segment lesions and organs, trained with a hybrid lesion/organ loss. The method achieves superior performance on Bone Metastasis and Covid-19 infection segmentation and demonstrates strong generalization on gland and nuclei segmentation through the hybrid encoder, with ablations confirming the importance of each component. The approach offers data-efficient, robust segmentation across tasks and provides practical inference efficiency, making it well-suited for clinical and research contexts, with code to be publicly released.

Abstract

Over the past two decades, machine analysis of medical imaging has advanced rapidly, opening up significant potential for several important medical applications. As complicated diseases increase and the number of cases rises, the role of machine-based imaging analysis has become indispensable. It serves as both a tool and an assistant to medical experts, providing valuable insights and guidance. A particularly challenging task in this area is lesion segmentation, a task that is challenging even for experienced radiologists. The complexity of this task highlights the urgent need for robust machine learning approaches to support medical staff. In response, we present our novel solution: the D-TrAttUnet architecture. This framework is based on the observation that different diseases often target specific organs. Our architecture includes an encoder-decoder structure with a composite Transformer-CNN encoder and dual decoders. The encoder includes two paths: the Transformer path and the Encoders Fusion Module path. The Dual-Decoder configuration uses two identical decoders, each with attention gates. This allows the model to simultaneously segment lesions and organs and integrate their segmentation losses. To validate our approach, we performed evaluations on the Covid-19 and Bone Metastasis segmentation tasks. We also investigated the adaptability of the model by testing it without the second decoder in the segmentation of glands and nuclei. The results confirmed the superiority of our approach, especially in Covid-19 infections and the segmentation of bone metastases. In addition, the hybrid encoder showed exceptional performance in the segmentation of glands and nuclei, solidifying its role in modern medical image analysis.
Paper Structure (23 sections, 24 equations, 7 figures, 10 tables)

This paper contains 23 sections, 24 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The summary of our proposed D-TrAttUnet approach.
  • Figure 2: Detailed Structure of the proposed D-TrAttUnet architecture.
  • Figure 3: Description of ResBlock (ResB), UpResBlock (UpR) and TransformerLayer.
  • Figure 4: Attention Gate block, where $g_i$ is the gating signal and the $x_i$ is the input feature maps. $M_{att}(h,w)$ is the obtained spatial attention, which is applied for all channels of the input feature maps ($x_i$).
  • Figure 5: Visual Comparison of Bone Metastasis Segmentation Models Trained with Different Architectures.
  • ...and 2 more figures