Primus: Enforcing Attention Usage for 3D Medical Image Segmentation
Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor Köhler, Klaus Maier-Hein
TL;DR
This work investigates why Transformer-based architectures underperform CNNs in 3D medical image segmentation and introduces Primus, the first pure Transformer model for this domain. By systematically deconstructing nine hybrid architectures, the authors show that non-Transformer parameters and architectural choices often drive performance, limiting Transformer effectiveness. Primus mitigates these issues by using high-resolution $8^3$ voxel tokens, 3D Rotary Positional Embeddings, SwiGLU MLP blocks, LayerScale, and a lightweight decoder to maximize attention-based learning with minimal convolution. Across multiple public datasets, Primus achieves competitive results with CNN baselines and outperforms several Transformer hybrids, marking a significant step toward Transformer-dominated 3D medical image segmentation and opening avenues for multi-modal integration and self-supervised pre-training.
Abstract
Transformers have achieved remarkable success across multiple fields, yet their impact on 3D medical image segmentation remains limited with convolutional networks still dominating major benchmarks. In this work, we a) analyze current Transformer-based segmentation models and identify critical shortcomings, particularly their over-reliance on convolutional blocks. Further, we demonstrate that in some architectures, performance is unaffected by the absence of the Transformer, thereby demonstrating their limited effectiveness. To address these challenges, we move away from hybrid architectures and b) introduce a fully Transformer-based segmentation architecture, termed Primus. Primus leverages high-resolution tokens, combined with advances in positional embeddings and block design, to maximally leverage its Transformer blocks. Through these adaptations Primus surpasses current Transformer-based methods and competes with state-of-the-art convolutional models on multiple public datasets. By doing so, we create the first pure Transformer architecture and take a significant step towards making Transformers state-of-the-art for 3D medical image segmentation.
