Table of Contents
Fetching ...

BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation

Libin Lan, Pengzhou Cai, Lu Jiang, Xiaojuan Liu, Yongmei Li, Yudong Zhang

TL;DR

This work proposes a hybrid yet effective CNN-Transformer network, named BRAU-Net++, which restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features.

Abstract

Accurate medical image segmentation is essential for clinical quantification, disease diagnosis, treatment planning and many other applications. Both convolution-based and transformer-based u-shaped architectures have made significant success in various medical image segmentation tasks. The former can efficiently learn local information of images while requiring much more image-specific inductive biases inherent to convolution operation. The latter can effectively capture long-range dependency at different feature scales using self-attention, whereas it typically encounters the challenges of quadratic compute and memory requirements with sequence length increasing. To address this problem, through integrating the merits of these two paradigms in a well-designed u-shaped architecture, we propose a hybrid yet effective CNN-Transformer network, named BRAU-Net++, for an accurate medical image segmentation task. Specifically, BRAU-Net++ uses bi-level routing attention as the core building block to design our u-shaped encoder-decoder structure, in which both encoder and decoder are hierarchically constructed, so as to learn global semantic information while reducing computational complexity. Furthermore, this network restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features. Extensive experiments on three public benchmark datasets demonstrate that our proposed approach surpasses other state-of-the-art methods including its baseline: BRAU-Net under almost all evaluation metrics. We achieve the average Dice-Similarity Coefficient (DSC) of 82.47, 90.10, and 92.94 on Synapse multi-organ segmentation, ISIC-2018 Challenge, and CVC-ClinicDB, as well as the mIoU of 84.01 and 88.17 on ISIC-2018 Challenge and CVC-ClinicDB, respectively.

BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation

TL;DR

This work proposes a hybrid yet effective CNN-Transformer network, named BRAU-Net++, which restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features.

Abstract

Accurate medical image segmentation is essential for clinical quantification, disease diagnosis, treatment planning and many other applications. Both convolution-based and transformer-based u-shaped architectures have made significant success in various medical image segmentation tasks. The former can efficiently learn local information of images while requiring much more image-specific inductive biases inherent to convolution operation. The latter can effectively capture long-range dependency at different feature scales using self-attention, whereas it typically encounters the challenges of quadratic compute and memory requirements with sequence length increasing. To address this problem, through integrating the merits of these two paradigms in a well-designed u-shaped architecture, we propose a hybrid yet effective CNN-Transformer network, named BRAU-Net++, for an accurate medical image segmentation task. Specifically, BRAU-Net++ uses bi-level routing attention as the core building block to design our u-shaped encoder-decoder structure, in which both encoder and decoder are hierarchically constructed, so as to learn global semantic information while reducing computational complexity. Furthermore, this network restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features. Extensive experiments on three public benchmark datasets demonstrate that our proposed approach surpasses other state-of-the-art methods including its baseline: BRAU-Net under almost all evaluation metrics. We achieve the average Dice-Similarity Coefficient (DSC) of 82.47, 90.10, and 92.94 on Synapse multi-organ segmentation, ISIC-2018 Challenge, and CVC-ClinicDB, as well as the mIoU of 84.01 and 88.17 on ISIC-2018 Challenge and CVC-ClinicDB, respectively.
Paper Structure (38 sections, 21 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 38 sections, 21 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Motivation. Due to the intrinsic locality of convolution operation as well as the high computation complexity of vanilla transformer, we consider incorporating sparse attention into U-shaped architecture, which can capture long-range dependency and reduce the computation cost to efficiently perform the medical image segmentation task. In practice, the main goal of using sparse attention mechanism is to ensure each query just attends to some most relevant key-value tokens. Since the tokens selected by static sparse attention are query-agnostic, we consider using query-aware, dynamic sparse attention mechanism in this work. Meanwhile, we consider restructuring skip connection with channel-spatial attention, which is implemented by convolution operation, aiming to amplify global dimension-interaction of multi-scale features.
  • Figure 2: Illustration of region gathering and token-to-token attention. By gathering the key and value tensors in routed regions, only GPU-friendly dense matrix multiplications are performed.
  • Figure 3: Details of a BiFormer block.
  • Figure 4: (a): The architecture of our BRAU-Net++, which is a u-shaped hybrid CNN-Transformer network and uses a sparse attention mechanism: bi-level routing attention as core building idea to hierarchically design the encoder-decoder structure. (b): The skip connection channel-spatial attention (SCCSA) module, which is implemented mainly by convolution operation, aiming to enhance the ability of cross-dimension interactions from both channel and spatial aspects and compensate for the loss of spatial information caused by down-sampling.
  • Figure 5: Qualitative comparisons of our approach against other state-of-the-art methods on Synapse multi-organ segmentation dataset. Our BRAU-Net++ shows a relatively better visualization than other methods. Best viewed in color with zoom-in.
  • ...and 2 more figures