GCtx-UNet: Efficient Network for Medical Image Segmentation
Khaled Alrfou, Tian Zhao
TL;DR
GCtx-UNet presents a lightweight UNet-inspired architecture that harnesses the Global Context Vision Transformer (GC-ViT) to model both long- and short-range spatial dependencies for medical image segmentation. By pairing GC-ViT blocks in an encoder-decoder with CNN-based downsampling and skip connections, the approach achieves competitive or superior Dice Similarity Coefficients and Hausdorff distances while reducing model size and computational cost. Pretraining on MedNet, a medical image corpus, significantly boosts in-domain performance over ImageNet pretraining and improves transfer to medical datasets such as Synapse, ACDC, and polyp segmentation tasks. Across Synapse, ACDC, and unseen polyp datasets, GCtx-UNet demonstrates strong generalization, efficiency, and robustness, making it a practical option for clinical deployment and suggesting future 3D extensions for voxel-level segmentation.
Abstract
Medical image segmentation is crucial for disease diagnosis and monitoring. Though effective, the current segmentation networks such as UNet struggle with capturing long-range features. More accurate models such as TransUNet, Swin-UNet, and CS-UNet have higher computation complexity. To address this problem, we propose GCtx-UNet, a lightweight segmentation architecture that can capture global and local image features with accuracy better or comparable to the state-of-the-art approaches. GCtx-UNet uses vision transformer that leverages global context self-attention modules joined with local self-attention to model long and short range spatial dependencies. GCtx-UNet is evaluated on the Synapse multi-organ abdominal CT dataset, the ACDC cardiac MRI dataset, and several polyp segmentation datasets. In terms of Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, GCtx-UNet outperformed CNN-based and Transformer-based approaches, with notable gains in the segmentation of complex and small anatomical structures. Moreover, GCtx-UNet is much more efficient than the state-of-the-art approaches with smaller model size, lower computation workload, and faster training and inference speed, making it a practical choice for clinical applications.
