Table of Contents
Fetching ...

Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Divyanshu Mishra, Mohammadreza Salehi, Pramit Saha, Olga Patey, Aris T. Papageorghiou, Yuki M. Asano, J. Alison Noble

TL;DR

The paper tackles SSL for echocardiography, where high frame similarity and subtle pathologies hinder traditional self-supervision. It introduces DISCOVR, a dual-branch framework that jointly models temporal dynamics via video self-distillation and leverages online spatial guidance from an evolving image encoder, bridged by a semantic cluster distillation loss $L_{SCD}$ to fuse spatial semantics into temporal representations. Across six datasets covering fetal, pediatric, and adult populations, DISCOVR achieves state-of-the-art performance in anomaly detection, zero-shot and linear probing classification, segmentation, and LVEF prediction, without labeled anomalies or pretrained models. The results demonstrate robust generalization, strong clinical relevance, and potential to scale echocardiography analysis with minimal labeling and augmentation requirements.

Abstract

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR

Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

TL;DR

The paper tackles SSL for echocardiography, where high frame similarity and subtle pathologies hinder traditional self-supervision. It introduces DISCOVR, a dual-branch framework that jointly models temporal dynamics via video self-distillation and leverages online spatial guidance from an evolving image encoder, bridged by a semantic cluster distillation loss to fuse spatial semantics into temporal representations. Across six datasets covering fetal, pediatric, and adult populations, DISCOVR achieves state-of-the-art performance in anomaly detection, zero-shot and linear probing classification, segmentation, and LVEF prediction, without labeled anomalies or pretrained models. The results demonstrate robust generalization, strong clinical relevance, and potential to scale echocardiography analysis with minimal labeling and augmentation requirements.

Abstract

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR

Paper Structure

This paper contains 22 sections, 9 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Figure (left) compares two fine-grained videos: a natural scene of a person baking (left) and an adult fetal heart ultrasound (right). The frame-level cosine similarity matrix, computed using a pretrained VideoMAE model, shows that ultrasound frames are highly similar (mean=0.99), with only minor local variations. This highlights the difficulty in distinguishing individual frames in such medical videos. Figure (right) compares normal and abnormal adult echocardiograms that appear nearly identical. However, on close inspection, it is revealed that the abnormal heart shows severe biventricular systolic dysfunction and a dilated, globular left ventricle, underscoring the subtlety of cardiac defects and the need for fine-grained structural analysis.
  • Figure 1: Comparison of video anomaly detection methods on three echocardiography datasets. Our method consistently outperforms SOTA approaches, demonstrating improved effectiveness in identifying cardiac abnormalities across diverse patient populations.
  • Figure 2: Overview of the DISCOVR framework. An input video is tokenized into 3D patches for the video branch and per-frame 2D patches for the image branch. Both encoders perform masked self-distillation. Masked video tokens are reconstructed by the video decoder, and dense semantic features are extracted from the image encoder. The $\mathcal{L}_{\text{SCD}}$ loss then aligns these outputs, distilling fine-grained spatial semantics into the video representation to produce rich spatio-temporal features.
  • Figure 3: Zero-Shot classification comparison: (Top) The sweep from four-chamber to three-vessel view reveals smaller left-sided structures (LV and Ao) versus right-sided (RV and PA), consistent with coarctation of the aorta. (Middle) DISCOVR correctly identifies the abnormality, focusing on the ventricles in the four-chamber view and the Ao and PA in the vessel view. (Bottom)A backbone pretrained with MVD, in contrast, misclassifies the video as normal.
  • Figure 4: Barplot comparing the segmentation performance across different models. Our proposed DISCOVR approach achieves the highest Dice score of 0.844, outperforming both specialized segmentation architectures (DeepLab-V3, UNET) and other self-supervised methods.
  • ...and 7 more figures