Table of Contents
Fetching ...

GCtx-UNet: Efficient Network for Medical Image Segmentation

Khaled Alrfou, Tian Zhao

TL;DR

GCtx-UNet presents a lightweight UNet-inspired architecture that harnesses the Global Context Vision Transformer (GC-ViT) to model both long- and short-range spatial dependencies for medical image segmentation. By pairing GC-ViT blocks in an encoder-decoder with CNN-based downsampling and skip connections, the approach achieves competitive or superior Dice Similarity Coefficients and Hausdorff distances while reducing model size and computational cost. Pretraining on MedNet, a medical image corpus, significantly boosts in-domain performance over ImageNet pretraining and improves transfer to medical datasets such as Synapse, ACDC, and polyp segmentation tasks. Across Synapse, ACDC, and unseen polyp datasets, GCtx-UNet demonstrates strong generalization, efficiency, and robustness, making it a practical option for clinical deployment and suggesting future 3D extensions for voxel-level segmentation.

Abstract

Medical image segmentation is crucial for disease diagnosis and monitoring. Though effective, the current segmentation networks such as UNet struggle with capturing long-range features. More accurate models such as TransUNet, Swin-UNet, and CS-UNet have higher computation complexity. To address this problem, we propose GCtx-UNet, a lightweight segmentation architecture that can capture global and local image features with accuracy better or comparable to the state-of-the-art approaches. GCtx-UNet uses vision transformer that leverages global context self-attention modules joined with local self-attention to model long and short range spatial dependencies. GCtx-UNet is evaluated on the Synapse multi-organ abdominal CT dataset, the ACDC cardiac MRI dataset, and several polyp segmentation datasets. In terms of Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, GCtx-UNet outperformed CNN-based and Transformer-based approaches, with notable gains in the segmentation of complex and small anatomical structures. Moreover, GCtx-UNet is much more efficient than the state-of-the-art approaches with smaller model size, lower computation workload, and faster training and inference speed, making it a practical choice for clinical applications.

GCtx-UNet: Efficient Network for Medical Image Segmentation

TL;DR

GCtx-UNet presents a lightweight UNet-inspired architecture that harnesses the Global Context Vision Transformer (GC-ViT) to model both long- and short-range spatial dependencies for medical image segmentation. By pairing GC-ViT blocks in an encoder-decoder with CNN-based downsampling and skip connections, the approach achieves competitive or superior Dice Similarity Coefficients and Hausdorff distances while reducing model size and computational cost. Pretraining on MedNet, a medical image corpus, significantly boosts in-domain performance over ImageNet pretraining and improves transfer to medical datasets such as Synapse, ACDC, and polyp segmentation tasks. Across Synapse, ACDC, and unseen polyp datasets, GCtx-UNet demonstrates strong generalization, efficiency, and robustness, making it a practical option for clinical deployment and suggesting future 3D extensions for voxel-level segmentation.

Abstract

Medical image segmentation is crucial for disease diagnosis and monitoring. Though effective, the current segmentation networks such as UNet struggle with capturing long-range features. More accurate models such as TransUNet, Swin-UNet, and CS-UNet have higher computation complexity. To address this problem, we propose GCtx-UNet, a lightweight segmentation architecture that can capture global and local image features with accuracy better or comparable to the state-of-the-art approaches. GCtx-UNet uses vision transformer that leverages global context self-attention modules joined with local self-attention to model long and short range spatial dependencies. GCtx-UNet is evaluated on the Synapse multi-organ abdominal CT dataset, the ACDC cardiac MRI dataset, and several polyp segmentation datasets. In terms of Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, GCtx-UNet outperformed CNN-based and Transformer-based approaches, with notable gains in the segmentation of complex and small anatomical structures. Moreover, GCtx-UNet is much more efficient than the state-of-the-art approaches with smaller model size, lower computation workload, and faster training and inference speed, making it a practical choice for clinical applications.
Paper Structure (20 sections, 1 equation, 7 figures, 7 tables)

This paper contains 20 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An illustration of the local and global attention mechanisms in GC-ViT hatamizadeh2023global. Local attention is computed on feature patches within local window only (left). The global attention mechanism extracts query patches from the entire input feature map, aggregating information from all windows. The global query is interacted with local key and value tokens, hence allowing to capture long-range information.
  • Figure 2: A GC-ViT block has a local and global attention, a global token generator, and a downsampling layer.
  • Figure 3: GCtx-UNet architecture includes encoders, bottlenecks, skip connections, and decoder. Encoder, bottleneck and decoder are all constructed based on GC-ViT block
  • Figure 4: Fused-MBConv module
  • Figure 5: Comparison of GCtx-UNet with ground truth, CS-UNet, and Swin-UNet on two sample images in Synapse dataset. Note that GCtx-UNet$^1$ is pre-trained on ImageNet and GCtx-UNet$^2$ is pre-trained on MedNet. The red rectangles identify the regions where Swin-UNet tends to have over-segmentation problems compared to GCtx-UNet and CS-UNet.
  • ...and 2 more figures