Multi-Modal Brain Tumor Segmentation via 3D Multi-Scale Self-attention and Cross-attention
Yonghao Huang, Leiting Chen, Chuan Zhou
TL;DR
This work tackles automatic segmentation of brain tumors from multi-modal 3D MRI by integrating CNNs and Transformers in a unified encoder-decoder architecture. It introduces two novel modules, 3D Multi-Scale Self-Attention (TMSM) and 3D Multi-Scale Cross-Attention (TMCM), to capture long-range dependencies and enable cross-scale feature fusion across encoding and decoding stages, along with a deep supervision strategy. The approach, termed TMA-TransBTS, achieves state-of-the-art performance on BraTS 2018–2020 datasets with a moderate model size, validating the effectiveness of multi-scale attention in 3D multi-modal medical segmentation. This framework advances the practical utility of Transformer-based methods in clinical imaging by improving accuracy and efficiency, and suggests directions for lightweight and more efficient attention mechanisms in future work.
Abstract
Due to the success of CNN-based and Transformer-based models in various computer vision tasks, recent works study the applicability of CNN-Transformer hybrid architecture models in 3D multi-modality medical segmentation tasks. Introducing Transformer brings long-range dependent information modeling ability in 3D medical images to hybrid models via the self-attention mechanism. However, these models usually employ fixed receptive fields of 3D volumetric features within each self-attention layer, ignoring the multi-scale volumetric lesion features. To address this issue, we propose a CNN-Transformer hybrid 3D medical image segmentation model, named TMA-TransBTS, based on an encoder-decoder structure. TMA-TransBTS realizes simultaneous extraction of multi-scale 3D features and modeling of long-distance dependencies by multi-scale division and aggregation of 3D tokens in a self-attention layer. Furthermore, TMA-TransBTS proposes a 3D multi-scale cross-attention module to establish a link between the encoder and the decoder for extracting rich volume representations by exploiting the mutual attention mechanism of cross-attention and multi-scale aggregation of 3D tokens. Extensive experimental results on three public 3D medical segmentation datasets show that TMA-TransBTS achieves higher averaged segmentation results than previous state-of-the-art CNN-based 3D methods and CNN-Transform hybrid 3D methods for the segmentation of 3D multi-modality brain tumors.
