Table of Contents
Fetching ...

HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

Fenghe Tang, Ronghao Xu, Qingsong Yao, Xueming Fu, Quan Quan, Heqin Zhu, Zaiyi Liu, S. Kevin Zhou

TL;DR

This work tackles the challenge of pre-training large-scale medical image models without labels by enabling end-to-end pre-training of a CNN-Transformer hybrid. HySparK introduces bottom-up 3D hybrid masking and uses sparse convolution in the CNN stage with patch-based ViT encoding, plus a hierarchical decoder with skip connections to fuse multi-scale features. Extensive experiments on 13 public 3D CT datasets show HySparK achieves state-of-the-art transfer to BTCV and MSD segmentation tasks, outperforming MAE, SimMIM, SparK, and other baselines. The approach highlights strong multi-scale representations and transferability in medical image analysis, with code released for reproducibility.

Abstract

The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK's promising prospects. The code is available at https://github.com/FengheTan9/HySparK

HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

TL;DR

This work tackles the challenge of pre-training large-scale medical image models without labels by enabling end-to-end pre-training of a CNN-Transformer hybrid. HySparK introduces bottom-up 3D hybrid masking and uses sparse convolution in the CNN stage with patch-based ViT encoding, plus a hierarchical decoder with skip connections to fuse multi-scale features. Extensive experiments on 13 public 3D CT datasets show HySparK achieves state-of-the-art transfer to BTCV and MSD segmentation tasks, outperforming MAE, SimMIM, SparK, and other baselines. The approach highlights strong multi-scale representations and transferability in medical image analysis, with code released for reproducibility.

Abstract

The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK's promising prospects. The code is available at https://github.com/FengheTan9/HySparK
Paper Structure (13 sections, 4 equations, 7 figures, 5 tables)

This paper contains 13 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Hybrid Sparse masKing (HySparK). The hybrid architecture comprises a CNN at the top (yellow) and a Transformer at the bottom (blue). We initiate the masking strategy at the junction between the CNN and Transformer and execute bottom-up mask modeling. The initialization unmasking patch is white, the bottom-up mapping unmasking patch is green and the masking position is black.
  • Figure 2: Reconstruction Result by HySparK.
  • Figure 3: Visualization Results on BTCV dataset.
  • Figure 4: Visualization Results on BTCV with 3D volume. (a) Ground Truth (b) HySparK (Ours) (c) MAE (d) SparK (e) SUP (f) SimMIM.
  • Figure 5: Visualization Results on MSD datasets. Row1 - Liver, Row2 - Lung, Row3 - Pancreas, Row4 - Hepatic Vessel, Row5 - Spleen, Row6 - Colon.
  • ...and 2 more figures