US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

Ashwath Radhachandran; Vedrana Ivezić; Shreeram Athreya; Ronit Anilkumar; Corey W. Arnold; William Speier

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

Ashwath Radhachandran, Vedrana Ivezić, Shreeram Athreya, Ronit Anilkumar, Corey W. Arnold, William Speier

TL;DR

This work proposes US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective and demonstrates that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

Abstract

Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

TL;DR

Abstract

Paper Structure (36 sections, 9 equations, 8 figures, 5 tables)

This paper contains 36 sections, 9 equations, 8 figures, 5 tables.

Introduction
Ultrasound Joint Embedding Predictive Architecture (US-JEPA)
Standardizing Evaluation: UltraBench
Related Work
Universal Vision Foundation Models
Task-Specific Ultrasound Foundation Models
General Ultrasound Foundation Models
Preliminaries
I-JEPA
Static-teacher Asymmetric Latent Training (SALT)
US-JEPA
Self-Distillation via SALT
Ultrasound Region-Conditioning (USrc)
Training Pipeline
US-JEPA Architecture
...and 21 more sections

Figures (8)

Figure 1: USrc-JEPA framework. Here we show the model training framework with USrc. URFM is the frozen teacher that extracts target embeddings. The student and predictor are jointly optimized with $\mathcal{L}_{US-JEPA}$ to align with the target.
Figure 2: Distribution of pretraining data. To characterize the dataset composition at the organ level, we report the distribution of a. temporal sequences, including videos and volumes ($n_v$), and b. individual static frames ($n_f$).
Figure 3: Results for few-shot scaling. We report the mean macro F1 score % over five seeds for each model's probe to measure performance with 1% to 100% of training labels. Note: each dataset is plotted on a unique y-axis scale to better highlight model-specific performance trends.
Figure 4: Robustness to domain-specific corruption. Results show the mean macro F1 across five seeds for each model-dataset-corruption permutation. Linear probes were trained on full, uncorrupted training sets and evaluated on increasingly corrupted test sets to assess structural representation stability.
Figure 6:
...and 3 more figures

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

TL;DR

Abstract

US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

Authors

TL;DR

Abstract

Table of Contents

Figures (8)