Table of Contents
Fetching ...

Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation

Andrea Dosi, Semanto Mondal, Rajib Chandra Ghosh, Massimo Brescia, Giuseppe Longo

Abstract

We adapt the remote sensing-inspired AMBER model from multi-band image segmentation to 3D medical datacube segmentation. To address the computational bottleneck of the volumetric transformer, we propose the AMBER-AFNO architecture. This approach uses Adaptive Fourier Neural Operators (AFNO) instead of the multi-head self-attention mechanism. Unlike spatial pairwise interactions between tokens, global token mixing in the frequency domain avoids $\mathcal{O}(N^2)$ attention-weight calculations. As a result, AMBER-AFNO achieves quasi-linear computational complexity and linear memory scaling. This new way to model global context reduces reliance on dense transformers while preserving global contextual modeling capability. By using attention-free spectral operations, our design offers a compact parameterization and maintains a competitive computational complexity. We evaluate AMBER-AFNO on three public datasets: ACDC, Synapse, and BraTS. On these datasets, the model achieves state-of-the-art or near-state-of-the-art results for DSC and HD95. Compared with recent compact CNN and Transformer architectures, our approach yields higher Dice scores while maintaining a compact model size. Overall, our results show that frequency-domain token mixing with AFNO provides a fast and efficient alternative to self-attention mechanisms for 3D medical image segmentation.

Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation

Abstract

We adapt the remote sensing-inspired AMBER model from multi-band image segmentation to 3D medical datacube segmentation. To address the computational bottleneck of the volumetric transformer, we propose the AMBER-AFNO architecture. This approach uses Adaptive Fourier Neural Operators (AFNO) instead of the multi-head self-attention mechanism. Unlike spatial pairwise interactions between tokens, global token mixing in the frequency domain avoids attention-weight calculations. As a result, AMBER-AFNO achieves quasi-linear computational complexity and linear memory scaling. This new way to model global context reduces reliance on dense transformers while preserving global contextual modeling capability. By using attention-free spectral operations, our design offers a compact parameterization and maintains a competitive computational complexity. We evaluate AMBER-AFNO on three public datasets: ACDC, Synapse, and BraTS. On these datasets, the model achieves state-of-the-art or near-state-of-the-art results for DSC and HD95. Compared with recent compact CNN and Transformer architectures, our approach yields higher Dice scores while maintaining a compact model size. Overall, our results show that frequency-domain token mixing with AFNO provides a fast and efficient alternative to self-attention mechanisms for 3D medical image segmentation.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: The Proposed AMBER-AFNO framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. FFN indicates a feed-forward network.
  • Figure 2: Simplified Work Flow Diagram of AMBER-AFNO Architecture
  • Figure 3: Qualitative comparison of AMBER-AFNO and UNETR++ segmentation predictions on representative ACDC validation samples. In the legend, RV denotes the Right Ventricular Cavity, Myo the Myocardium, and LV the Left Ventricular Cavity.
  • Figure 4: Visual Comparison of AMBER-AFNO and UNETR++ Model's Prediction on Synapse Dataset. In the legend, Spl denotes the Spleen, RKid the Right Kidney, LKid the Left Kidney, Gal the Gallbladder, Liv the Liver, Sto the Stomach, Aor the Aorta, and Pan the Pancreas.
  • Figure 5: Visual comparison of AMBER-AFNO and UNETR++ predictions on the BraTS dataset. In the legend, WT denotes the Whole Tumor, ET the Enhancing Tumor, and TC the Tumor Core.
  • ...and 1 more figures