Table of Contents
Fetching ...

Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation

Xinyu Liu, Zhen Chen, Wuyang Li, Chenxin Li, Yixuan Yuan

Abstract

Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.

Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation

Abstract

Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.
Paper Structure (39 sections, 17 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 39 sections, 17 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: Performance comparison between the proposed CSE-Light-UNETR, CSE-VNet with previous SSMIS methods on the LA dataset LA. The circle size represents the number of trainable parameters in different methods.
  • Figure 2: Comparison between different methods for SSMIS. (a) Traditional methods yu2019uncertaintyawaremeanteacherwu2022mutualBCP combine conv-based medical image segmentation models vnet with semi-supervised learning techniques such as mean-teacher meanteacher, which are not only computational expensive but also have poor performance. (b) Recent methods luo2022semicnntransformer jointly train Transformers swin with conv-based models, which are limited to 2D data and rely on natural-domain pretrained models. (c) Our proposed model applies a single shared lightweight Transformer model for SSMIS on 3D medical volumes, which achieves better performance with significantly less computational cost.
  • Figure 3: Illustration of the proposed (a) Overall Light-UNETR structure, (b) Light-UNETR Block (c) LIDR module, (d) CGLU module, (e) overlap patch embedding, and (f) Light downsample.
  • Figure 4: Illustration of the CSE semi-supervised learning framework, which contains (a) Attention-Guided Replacement (AGR) and (b) Spatial Masking Consistency (SMC).
  • Figure 5: Qualitative segmentation results comparison of different methods on the LA dataset with 5% labeled data. From left-to-right are UA-MT yu2019uncertaintyawaremeanteacher, SS-Net wu2022ssnet, CAML CAML, BCP BCP, MLRP su2024mutual_MLRP, our CSE-Light-UNETR, and ground-truth.
  • ...and 4 more figures