Table of Contents
Fetching ...

Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

Fenghe Tang, Qingsong Yao, Wenxin Ma, Chenxu Wu, Zihang Jiang, S. Kevin Zhou

TL;DR

Hi-End-MAE tackles label scarcity in medical image segmentation by proposing encoder-driven reconstruction and hierarchical dense decoding for ViT-based pre-training. By querying encoder representations through cross-attention and progressively decoding from multiple layers, the method learns richer, layer-aware anatomical representations and reduces decoder reliance. Empirical results across seven downstream datasets (including one-shot and MRI transfer) show state-of-the-art segmentation performance with improved efficiency. The work demonstrates the value of leveraging cross-layer information in masked image modeling for medical imaging and suggests directions for scalable, anatomy-centered SSL.

Abstract

Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE

Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

TL;DR

Hi-End-MAE tackles label scarcity in medical image segmentation by proposing encoder-driven reconstruction and hierarchical dense decoding for ViT-based pre-training. By querying encoder representations through cross-attention and progressively decoding from multiple layers, the method learns richer, layer-aware anatomical representations and reduces decoder reliance. Empirical results across seven downstream datasets (including one-shot and MRI transfer) show state-of-the-art segmentation performance with improved efficiency. The work demonstrates the value of leveraging cross-layer information in masked image modeling for medical imaging and suggests directions for scalable, anatomy-centered SSL.

Abstract

Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE

Paper Structure

This paper contains 20 sections, 7 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Decoder-driven vs. encoder-driven reconstruction. Conventional MAE is based on (a) decoder-driven reconstruction and Hi-End-MAE is based on (b) encoder-driven reconstruction. The slice-based (the first row) and volume-based (the second row) attention maps for query patches (red box) on different anatomical structures in the last layer of ViT, pre-trained by MAE and Hi-End-MAE, are visualized. The attention maps of MAE tend to attention on limited local contexts while those of Hi-End-MAE tend to be of more complete anatomical contexts, which are more instrumental to medical image segmentation.
  • Figure 2: Performance comparisons against well-known medical self-supervised learning method using different pre-training data scales. Top figure represents the results fine-tuned by 10% proportion of data, while the bottom figure represents fine-tuning with only one single 3D volume (one-shot).
  • Figure 3: The overall framework of Hi-End-MAE. The Encoder-driven Dense Decoding architecture uses encoder representations to guide the decoder bottom-up dense reconstruction. The encoder (blue) is a Vision Transformer (ViT), which only processes the visible patches (blue cube). The decoder (green) incorporates a cross-attention mechanism, feeding in a full set of token i.e. visible token (grey cube) and learnable masked token (mosaic cube) to query the encoder representation (blue arrow) for encoder-driven reconstruction.
  • Figure 4: Comparative analysis one-shot segmentation results across 16 targets in terms of DSC (%) performance. The abbreviations Lun, Liv, Kid, Vein, Aor, Int, Spl, IVC, Sto, Duo, Pan, Pro/ute, ADR, Rec, Gall and Eso correspond to Lung, Liver, Kidney, Veins, Aorta, Intestine, Spleen, Inferior Vena Cava, Stomach, Duodenum, Pancreas, Prostate/Uterus, Adrenal Gland, Rectum, Gallbladder and Esophagus, respectively.
  • Figure 5: Visualization of attention maps in the 3th, 6th, 9th, and 12th layers of ViT-B/12$^{(1536)}$ for query patches (red box) on different organs, pre-trained by MAE and Hi-End-MAE. The attention maps correspond to the same attention head in both MAE and Hi-End-MAE encoder.
  • ...and 9 more figures