WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation
Md Mahfuz Al Hasan, Mahdi Zaman, Abdul Jawad, Alberto Santamaria-Pang, Ho Hin Lee, Ivan Tarapov, Kyle See, Md Shah Imran, Antika Roy, Yaser Pourmohammadi Fallah, Navid Asadizanjani, Reza Forghani
TL;DR
3D medical image segmentation requires modeling global context while preserving fine details, yet standard transformers are memory-intensive. WaveFormer introduces a 3D Transformer that uses discrete wavelet transforms to split features into low-frequency global context and high-frequency details, performing self-attention on LF and reconstructing details via IDWT, with a multi-scale attention strategy. The contributions include frequency-domain representation learning on LF, an efficient frequency-guided decoder using IDWT, and enhanced local-global context through multi-scale attention with a squeeze-and-excitation bottleneck. On BraTS2023, FLARE2021, and KiTS2023, WaveFormer matches or exceeds state-of-the-art performance while reducing parameters and inference time, highlighting the practicality of wavelet-based transformers for efficient 3D medical imaging.
Abstract
Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual representation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architecture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency details while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where computational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.
