Table of Contents
Fetching ...

Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging

Ron Keuth, Paul Kaftan, Mattias P. Heinrich

TL;DR

This paper systematically compares six token mixers within the MetaFormer backbone for medical imaging across eight datasets, addressing both classification (global prediction) and segmentation (dense prediction). It finds that for classification, low-complexity mixers such as grouped convolution and pooling generally suffice, and pretrained weights offer limited but dataset-dependent gains; conversely, segmentation benefits from a strong local inductive bias, with grouped convolution providing the best trade-off between accuracy, runtime, and parameters. The study highlights that transfer learning from natural images can help, especially when the token mixer remains similar to the pretrained one, but the domain shift often limits gains in medical data. Overall, the results provide practical guidance: use grouped convolution or pooling for classification and grouped convolution with larger kernels for segmentation, while acknowledging training instabilities in some attention-based mixers and the potential of future 3D extensions.

Abstract

The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions.

Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging

TL;DR

This paper systematically compares six token mixers within the MetaFormer backbone for medical imaging across eight datasets, addressing both classification (global prediction) and segmentation (dense prediction). It finds that for classification, low-complexity mixers such as grouped convolution and pooling generally suffice, and pretrained weights offer limited but dataset-dependent gains; conversely, segmentation benefits from a strong local inductive bias, with grouped convolution providing the best trade-off between accuracy, runtime, and parameters. The study highlights that transfer learning from natural images can help, especially when the token mixer remains similar to the pretrained one, but the domain shift often limits gains in medical data. Overall, the results provide practical guidance: use grouped convolution or pooling for classification and grouped convolution with larger kernels for segmentation, while acknowledging training instabilities in some attention-based mixers and the potential of future 3D extensions.

Abstract

The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions.

Paper Structure

This paper contains 32 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Schema of the MetaFormer block illustrating different token mixer options employed across stages of the MetaFormer architecture (cf. Fig. \ref{['fig:architetcure']}).
  • Figure 2: Visualization of flops for different token mixers (\ref{['fig:flops']}) and schema of MetaFormer's architecture (\ref{['fig:architetcure']}).
  • Figure 3: auc for MetaFormer trained from scratch with different token mixers (architecture signature $[T, T, T, T]$). Line plots indicate results with a monotonic change across kernel sizes.
  • Figure 4: Performance of the MetaFormer using pretrained ImageNet weights for its channel-MLPs and different token mixers in the last two stages (architecture signature $[P,P,T,T]$). Line plots indicate results with a monotonic change across kernel sizes.
  • Figure 5: dsc across three medical datasets for semantic segmentation, investigating into different token mixer and kernel sizes. Line plots indicate performance with a monotonic change across kernel sizes.