Table of Contents
Fetching ...

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

Ashwath Radhachandran, Vedrana Ivezić, Shreeram Athreya, Ronit Anilkumar, Corey W. Arnold, William Speier

TL;DR

This work proposes US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective and demonstrates that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

Abstract

Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

TL;DR

This work proposes US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective and demonstrates that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

Abstract

Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.
Paper Structure (36 sections, 9 equations, 8 figures, 5 tables)

This paper contains 36 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: USrc-JEPA framework. Here we show the model training framework with USrc. URFM is the frozen teacher that extracts target embeddings. The student and predictor are jointly optimized with $\mathcal{L}_{US-JEPA}$ to align with the target.
  • Figure 2: Distribution of pretraining data. To characterize the dataset composition at the organ level, we report the distribution of a. temporal sequences, including videos and volumes ($n_v$), and b. individual static frames ($n_f$).
  • Figure 3: Results for few-shot scaling. We report the mean macro F1 score % over five seeds for each model's probe to measure performance with 1% to 100% of training labels. Note: each dataset is plotted on a unique y-axis scale to better highlight model-specific performance trends.
  • Figure 4: Robustness to domain-specific corruption. Results show the mean macro F1 across five seeds for each model-dataset-corruption permutation. Linear probes were trained on full, uncorrupted training sets and evaluated on increasingly corrupted test sets to assess structural representation stability.
  • Figure 6:
  • ...and 3 more figures