Table of Contents
Fetching ...

UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation

Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

TL;DR

UNETR++ introduces Efficient Paired Attention (EPA), a lightweight, dual-branch attention mechanism that jointly models spatial and channel dependencies with shared queries/keys to reduce parameters and computation. Built on a four-stage hierarchical encoder–decoder, EPA provides linear-time spatial attention and channel-wise feature coupling, improving segmentation accuracy while cutting compute relative to prior transformer-based 3D segmentation methods. Across five benchmarks, UNETR++ achieves state-of-the-art Dice scores (notably 87.2% on Synapse) with substantial reductions in both parameters and FLOPs, demonstrating practical benefits for large-volume medical imaging. The approach offers robust, efficient 3D segmentation suitable for clinical and research settings, with potential for further gains via targeted geometric data augmentation.

Abstract

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy. On Synapse, our UNETR++ sets a new state-of-the-art with a Dice Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.

UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation

TL;DR

UNETR++ introduces Efficient Paired Attention (EPA), a lightweight, dual-branch attention mechanism that jointly models spatial and channel dependencies with shared queries/keys to reduce parameters and computation. Built on a four-stage hierarchical encoder–decoder, EPA provides linear-time spatial attention and channel-wise feature coupling, improving segmentation accuracy while cutting compute relative to prior transformer-based 3D segmentation methods. Across five benchmarks, UNETR++ achieves state-of-the-art Dice scores (notably 87.2% on Synapse) with substantial reductions in both parameters and FLOPs, demonstrating practical benefits for large-volume medical imaging. The approach offers robust, efficient 3D segmentation suitable for clinical and research settings, with potential for further gains via targeted geometric data augmentation.

Abstract

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy. On Synapse, our UNETR++ sets a new state-of-the-art with a Dice Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.
Paper Structure (18 sections, 8 equations, 10 figures, 6 tables)

This paper contains 18 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Left: Qualitative comparison between the baseline UNETR UNETR and our UNETR++ on Synapse. We present two examples containing multiple organs. Each inaccurate segmented region is marked with a white dashed box. In the first row, UNETR struggles to accurately segment the right kidney (RKid) and confuses it with gallbladder (Gal). Further, both the stomach (Sto) and left adrenal gland (LAG) tissues are inaccurately segmented. In the second row, UNETR struggles to segment the whole spleen and mixes it with stomach (Sto) and portal and splenic veins (PSV). Moreover, it under and over-segments certain organs (e.g., PSV and Sto). In comparison, our UNETR++ that efficiently encodes enriched inter-dependent spatial and channel features within the proposed EPA block, accurately segments all organs in these examples. Best viewed zoomed in. Additional qualitative comparisons are presented in Fig. \ref{['Fig:Qualitative_results']} and supplementary material. Right: Accuracy (Dice score) vs. model complexity (FLOPs and parameters) comparison on Synapse. Compared to best existing nnFormer nnFormer, UNETR++ achieves better segmentation performance while significantly reduces the model complexity by over 71%.
  • Figure 2: Overview of our UNETR++ approach with hierarchical encoder-decoder structure. The 3D patches are fed to the encoder, whose outputs are then connected to the decoder via skip connections followed by convolutional blocks to produce the final segmentation mask. The focus of our design is the introduction of an efficient paired-attention (EPA) block (Sec. \ref{['sec:EPA']}). Each EPA block performs two tasks using parallel attention modules with shared keys-queries and different value layers to efficiently learn enriched spatial-channel feature representations. As illustrated in the EPA block diagram (on the right), the first (top) attention module aggregates the spatial features by a weighted sum of the projected features in a linear manner to compute the spatial attention maps, while the second (bottom) attention module emphasizes the dependencies in the channels and computes the channel attention maps. Finally, the outputs of the two attention modules are fused and passed to convolutional blocks to enhance the feature representation, leading to better segmentation masks.
  • Figure 3: Qualitative comparison between UNETR++ and baseline UNETR on Synapse. For better visualization, we enlarged different areas (marked in green dashed box) in the images. The inaccurate segmentations are marked by red dashed boxes. Compared to the baseline, UNETR++ achieves superior segmentation performance. Best viewed in zoom.
  • Figure 4: Qualitative comparison on multi-organ segmentation task. Here, we compare our UNETR++ with existing methods: UNETR, Swin UNETR, and nnFormer. Existing methods struggle to correctly segment different organs (marked in red dashed box). Our UNETR++ achieves promising segmentation performance by accurately segmenting the organs. Best viewed in zoom.
  • Figure 5: Additional qualitative comparison on Synapse dataset. We compare our UNETR++ with existing methods: UNETR, Swin UNETR, and nnFormer. It is noticeable that the existing methods struggle to correctly segment different organs (marked in red dashed box). Our UNETR++ achieves promising segmentation performance by accurately segmenting the organs. Best viewed zoomed in.
  • ...and 5 more figures