Table of Contents
Fetching ...

Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

Yunhe Gao, Yabin Zhang, Chong Wang, Jiaming Liu, Maya Varma, Jean-Benoit Delbrouck, Akshay Chaudhari, Curtis Langlotz

Abstract

Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: https://github.com/Stanford-AIMI/MASS.

Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

Abstract

Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: https://github.com/Stanford-AIMI/MASS.
Paper Structure (26 sections, 8 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: MASS framework overview. (A) Annotation-free mask generation: SAM2 generates class-agnostic masks by sampling 2D slices from unlabeled 3D images, applying automatic segmentation, and propagating masks through volumes. B) Mask-guided self-supervised learning: For each training step, we sample an image $x$ and its auto-generated masks $m$. We then create two augmented views: a reference$(x_s, y_s)$ and a query$(x_q, y_q)$. The model extracts a task embedding from the reference mask and uses it to predict the corresponding region in the query view. By solving many such in-context segmentation tasks, the model learns generalizable, semantically rich representations.
  • Figure 2: Progressive scaling analysis. Diversity along anatomical and modality dimensions drives generalization.
  • Figure 3: SAM2 generated 2D masks on initial seed slices. We show examples from abdomen CT, head & neck CT, and abdomen MR. Images appear in color because we create 3-channel inputs for SAM2 using three complementary intensity windows. Each tissue's color reflects its relative intensity across the three channels. SAM2 generates meaningful region proposals with good coverage of diverse anatomical structures including organs, bones, muscles, sub-organ regions, and pathologies (cysts). However, the masks also contain substantial noise from over-segmentation, missing objects and struggle with small subtle structures. Despite these imperfections, the diverse mask proposals spanning multiple granularities provide sufficient supervision for MASS to learn generalizable representations.
  • Figure 4: SAM2 3D mask propagation results. We show 3D masks generated by propagating the 2D seed masks from Figure \ref{['fig:sam_2d']} through the volume using SAM2's video prediction capability. SAM2 successfully converts 2D masks into volumetric segmentations by tracking boundaries across slices. While propagation maintains anatomical coherence for most structures, the results still contain noise from inconsistent boundaries and occasional tracking failures. These imperfect but volumetrically consistent masks provide the structural supervision needed for MASS pretraining.
  • Figure 5: Task encoding module architecture. The module extracts compact task embeddings from reference image-mask pairs through two parallel streams: foreground feature encoding (top) captures fine anatomical details via high-resolution mask application, while contextual feature encoding (bottom) uses pixel shuffle operations and learnable query tokens with cross/self-attention to extract global context. The combined task embedding guides query image segmentation through the mask decoder. We follow gao2025show and refer the readers for more details in gao2025show
  • ...and 3 more figures