Table of Contents
Fetching ...

WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation

Md Mahfuz Al Hasan, Mahdi Zaman, Abdul Jawad, Alberto Santamaria-Pang, Ho Hin Lee, Ivan Tarapov, Kyle See, Md Shah Imran, Antika Roy, Yaser Pourmohammadi Fallah, Navid Asadizanjani, Reza Forghani

TL;DR

3D medical image segmentation requires modeling global context while preserving fine details, yet standard transformers are memory-intensive. WaveFormer introduces a 3D Transformer that uses discrete wavelet transforms to split features into low-frequency global context and high-frequency details, performing self-attention on LF and reconstructing details via IDWT, with a multi-scale attention strategy. The contributions include frequency-domain representation learning on LF, an efficient frequency-guided decoder using IDWT, and enhanced local-global context through multi-scale attention with a squeeze-and-excitation bottleneck. On BraTS2023, FLARE2021, and KiTS2023, WaveFormer matches or exceeds state-of-the-art performance while reducing parameters and inference time, highlighting the practicality of wavelet-based transformers for efficient 3D medical imaging.

Abstract

Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual representation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architecture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency details while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where computational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.

WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation

TL;DR

3D medical image segmentation requires modeling global context while preserving fine details, yet standard transformers are memory-intensive. WaveFormer introduces a 3D Transformer that uses discrete wavelet transforms to split features into low-frequency global context and high-frequency details, performing self-attention on LF and reconstructing details via IDWT, with a multi-scale attention strategy. The contributions include frequency-domain representation learning on LF, an efficient frequency-guided decoder using IDWT, and enhanced local-global context through multi-scale attention with a squeeze-and-excitation bottleneck. On BraTS2023, FLARE2021, and KiTS2023, WaveFormer matches or exceeds state-of-the-art performance while reducing parameters and inference time, highlighting the practicality of wavelet-based transformers for efficient 3D medical imaging.

Abstract

Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limitations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual representation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architecture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency details while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where computational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.

Paper Structure

This paper contains 11 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architecture of the proposed WaveFormer. The block details are provided in Figure \ref{['fig:block']}.
  • Figure 2: (a) Block architecture of the proposed network for the $i$-th encoder stage. Attention is computed solely on the approximation coefficients obtained from multi-level DWT. The HF components extracted at each attention layer are combined and passed to the decoder along with the final stage output. (b) Multi-scale attention design for encoder Stage 1, where attention is computed at each resolution level of the DWT. A fixed window size matching the lowest resolution (third DWT level) enables both global and local context capture in a single attention layer.