Leveraging Self-Supervised Learning for Fetal Cardiac Planes Classification using Ultrasound Scan Videos
Joseph Geo Benjamin, Mothilal Asokan, Amna Alhosani, Hussain Alasmawi, Werner Gerhard Diehl, Leanne Bricker, Karthik Nandakumar, Mohammad Yaqub
TL;DR
This paper assesses how self-supervised learning on unlabelled fetal ultrasound videos can enhance downstream Standard Fetal Cardiac Planes classification when labeled data is scarce. It benchmarks seven dual-encoder SSL methods (spanning reconstruction, contrastive, distillation, and information-theoretic objectives) using a ResNet-50 backbone to pretrain on US videos, then fine-tunes or linearly probes on limited 2D SFCP images. The study reveals that dataset variance drives generalization more than sheer size, with BarlowTwins providing the most robust transfer, and a striking gain of $12\%$ F1-score when using full fine-tuning with only $1\%$ labeled data compared to ImageNet initialisation (and at least $4\%$ over other SSL initialisations). These findings suggest a practical pathway for leveraging private US video data to improve fetal cardiac plane classification without requiring large annotated datasets. The results have meaningful implications for clinical deployment and underscore the value of SSL methods that decorrelate representations in medical video domains.
Abstract
Self-supervised learning (SSL) methods are popular since they can address situations with limited annotated data by directly utilising the underlying data distribution. However, the adoption of such methods is not explored enough in ultrasound (US) imaging, especially for fetal assessment. We investigate the potential of dual-encoder SSL in utilizing unlabelled US video data to improve the performance of challenging downstream Standard Fetal Cardiac Planes (SFCP) classification using limited labelled 2D US images. We study 7 SSL approaches based on reconstruction, contrastive loss, distillation, and information theory and evaluate them extensively on a large private US dataset. Our observations and findings are consolidated from more than 500 downstream training experiments under different settings. Our primary observation shows that for SSL training, the variance of the dataset is more crucial than its size because it allows the model to learn generalisable representations, which improve the performance of downstream tasks. Overall, the BarlowTwins method shows robust performance, irrespective of the training settings and data variations, when used as an initialisation for downstream tasks. Notably, full fine-tuning with 1% of labelled data outperforms ImageNet initialisation by 12% in F1-score and outperforms other SSL initialisations by at least 4% in F1-score, thus making it a promising candidate for transfer learning from US video to image data.
