Table of Contents
Fetching ...

Y-CA-Net: A Convolutional Attention Based Network for Volumetric Medical Image Segmentation

Muhammad Hamza Sharif, Muzammal Naseer, Mohammad Yaqub, Min Xu, Mohsen Guizani

TL;DR

It is argued Y-CA-Net is a key player in achieving superior results for volumetric segmentation, a versatile generic architecture based upon any two encoders and a decoder backbones, to fully exploit the complementary strengths of both convolution and attention mechanisms.

Abstract

Recent attention-based volumetric segmentation (VS) methods have achieved remarkable performance in the medical domain which focuses on modeling long-range dependencies. However, for voxel-wise prediction tasks, discriminative local features are key components for the performance of the VS models which is missing in attention-based VS methods. Aiming at resolving this issue, we deliberately incorporate the convolutional encoder branch with transformer backbone to extract local and global features in a parallel manner and aggregate them in Cross Feature Mixer Module (CFMM) for better prediction of segmentation mask. Consequently, we observe that the derived model, Y-CT-Net, achieves competitive performance on multiple medical segmentation tasks. For example, on multi-organ segmentation, Y-CT-Net achieves an 82.4% dice score, surpassing well-tuned VS Transformer/CNN-like baselines UNETR/ResNet-3D by 2.9%/1.4%. With the success of Y-CT-Net, we extend this concept with hybrid attention models, that derived Y-CH-Net model, which brings a 3% improvement in terms of HD95 score for same segmentation task. The effectiveness of both models Y-CT-Net and Y-CH-Net verifies our hypothesis and motivates us to initiate the concept of Y-CA-Net, a versatile generic architecture based upon any two encoders and a decoder backbones, to fully exploit the complementary strengths of both convolution and attention mechanisms. Based on experimental results, we argue Y-CA-Net is a key player in achieving superior results for volumetric segmentation.

Y-CA-Net: A Convolutional Attention Based Network for Volumetric Medical Image Segmentation

TL;DR

It is argued Y-CA-Net is a key player in achieving superior results for volumetric segmentation, a versatile generic architecture based upon any two encoders and a decoder backbones, to fully exploit the complementary strengths of both convolution and attention mechanisms.

Abstract

Recent attention-based volumetric segmentation (VS) methods have achieved remarkable performance in the medical domain which focuses on modeling long-range dependencies. However, for voxel-wise prediction tasks, discriminative local features are key components for the performance of the VS models which is missing in attention-based VS methods. Aiming at resolving this issue, we deliberately incorporate the convolutional encoder branch with transformer backbone to extract local and global features in a parallel manner and aggregate them in Cross Feature Mixer Module (CFMM) for better prediction of segmentation mask. Consequently, we observe that the derived model, Y-CT-Net, achieves competitive performance on multiple medical segmentation tasks. For example, on multi-organ segmentation, Y-CT-Net achieves an 82.4% dice score, surpassing well-tuned VS Transformer/CNN-like baselines UNETR/ResNet-3D by 2.9%/1.4%. With the success of Y-CT-Net, we extend this concept with hybrid attention models, that derived Y-CH-Net model, which brings a 3% improvement in terms of HD95 score for same segmentation task. The effectiveness of both models Y-CT-Net and Y-CH-Net verifies our hypothesis and motivates us to initiate the concept of Y-CA-Net, a versatile generic architecture based upon any two encoders and a decoder backbones, to fully exploit the complementary strengths of both convolution and attention mechanisms. Based on experimental results, we argue Y-CA-Net is a key player in achieving superior results for volumetric segmentation.
Paper Structure (20 sections, 7 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Y-CA-Net and comparison of performance of Y-CA-Net based model with standalone convolutional and attention based model on Synapse dataset. As shown in (a), we introduce Y-CA-Net as a general architecture based upon two encoder branches, i.e. convolutional and attention, a Cross Feature Mixer Module (CFMM), and a decoder. Our network, Y-CA-Net, is a hybrid architecture that leverages the strengths of both convolution and attention mechanisms and combines them through the CFMM before passing them to the decoder. We have presented two variants of our Y-shaped network, namely Y-CT-Net and Y-CH-Net, with different encoder and decoder backbones. We argue that for voxel-wise prediction tasks, preserving the local structure between neighborhood voxels is important, which is missed by the attention mechanism. To address this issue, we employ two encoders, a local encoder that relies solely on convolution-based operations and a global encoder that employs attention-based mechanisms. The features extracted from both encoders are then fused in the CFMM. Remarkably, the resulting model, Y-CT-Net, outperforms the well-tuned stand-alone vision Transformer baseline (UNETR) and CNN as shown in (b) which supports that Y-CA-Net provides competitive performance for volumetric segmentation.
  • Figure 2: Visual comparison of transformer and ConvNet baseline with our Y-CT-Net on a multi-organ segmentation task. We observe some distinct differences that are highlighted by small square boxes.
  • Figure 3: An overview of the architecture Y-CT-Net designed for volumetric medical image segmentation, which takes a 3D input volume and passes it through two encoder branches. The local encoder branch consists of a ResNet3D, which learns local features using 3D convolution operations. The global encoder branch uses the backbone of a vision transformer to extract global features using a multi-head self-attention mechanism. Two stages of ResNet3D (Stage 2 and Stage 3) are used, with ResNet blocks of [4, 16]. The extracted features from both the local and global branches are fed into the Cross Feature Mixer Module (CFMM), which integrates local and global features to capture both local and global context information. The output of the CFMM and bottleneck module are then fed to a CNN-decoder via skip connections at multiple resolutions to predict the segmentation mask.
  • Figure 4: Qualitative results on multi-organ segmentation. Visual results demonstrate that our method has not only performed accurate prediction but also preserve the organ information which is missed by each all three attention-based methods. For example, our Y-CH-Net is performing better than nnFormer, as boundary of stomach organ is being correctly predicted.
  • Figure 5: Qualitative results on brain tumor segmentation
  • ...and 5 more figures