Table of Contents
Fetching ...

Pretext Matters: An Empirical Study of SSL Methods in Medical Imaging

Vedrana Ivezić, Mara Pleasure, Ashwath Radhachandran, Saarang Panchavati, Shreeram Athreya, Vivek Sant, Benjamin Emert, Gregory Fishbein, Corey Arnold, William Speier

Abstract

Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.

Pretext Matters: An Empirical Study of SSL Methods in Medical Imaging

Abstract

Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.
Paper Structure (42 sections, 13 figures, 4 tables)

This paper contains 42 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: SSL pretext tasks on a lung image for (a) MAE, (b) I-JEPA, and (c) DINOv3.
  • Figure 2: Example visualizations including attention maps, cosine similarity maps (anchor points denoted by red X), and PCA maps for both ultrasound and pathology.
  • Figure 3: Attention heatmaps from each attention head of the final ViT layer across models pre-trained on ultrasounds. The dark oval area in the GBCU image is the gallbladder with the liver region to the left. The oval area of the Fatty Liver image is the kidney, the white line above is the diaphragm, and the shaded region to the right is the liver. The TN5000 image contains a thyroid nodule marked with calipers.
  • Figure 4: Cosine similarity maps computed in relation to anchor patches (red X).
  • Figure 5: Attention heatmaps from each attention head of the final ViT layer across models pre-trained on histopathology data for three downstream datasets.
  • ...and 8 more figures