Table of Contents
Fetching ...

SS-DPPN: A self-supervised dual-path foundation model for the generalizable cardiac audio representation

Ummy Maria Muna, Md Mehedi Hasan Shawon, Md Jobayer, Sumaiya Akter, Md Rakibul Hasan, Md. Golam Rabiul Alam

TL;DR

SS-DPPN introduces a self-supervised, dual-path foundation model for cardiac audio by jointly learning from raw 1D waveforms and 2D mel-spectrograms. It employs a hybrid loss that fuses instance-level contrastive learning with global distribution alignment via the Wasserstein distance and uses a prototypical network for robust, imbalanced downstream classification. The approach achieves state-of-the-art results across four heart-sound benchmarks, demonstrates exceptional data efficiency with a threefold reduction in labeled data, and generalizes to cross-domain tasks such as lung sounds and heart-rate estimation. These findings establish SS-DPPN as a robust, scalable foundation model for physiological signals with strong calibration and transferability, addressing annotation bottlenecks in medical AI.

Abstract

The automated analysis of phonocardiograms is vital for the early diagnosis of cardiovascular disease, yet supervised deep learning is often constrained by the scarcity of expert-annotated data. In this paper, we propose the Self-Supervised Dual-Path Prototypical Network (SS-DPPN), a foundation model for cardiac audio representation and classification from unlabeled data. The framework introduces a dual-path contrastive learning based architecture that simultaneously processes 1D waveforms and 2D spectrograms using a novel hybrid loss. For the downstream task, a metric-learning approach using a Prototypical Network was used that enhances sensitivity and produces well-calibrated and trustworthy predictions. SS-DPPN achieves state-of-the-art performance on four cardiac audio benchmarks. The framework demonstrates exceptional data efficiency with a fully supervised model on three-fold reduction in labeled data. Finally, the learned representations generalize successfully across lung sound classification and heart rate estimation. Our experiments and findings validate SS-DPPN as a robust, reliable, and scalable foundation model for physiological signals.

SS-DPPN: A self-supervised dual-path foundation model for the generalizable cardiac audio representation

TL;DR

SS-DPPN introduces a self-supervised, dual-path foundation model for cardiac audio by jointly learning from raw 1D waveforms and 2D mel-spectrograms. It employs a hybrid loss that fuses instance-level contrastive learning with global distribution alignment via the Wasserstein distance and uses a prototypical network for robust, imbalanced downstream classification. The approach achieves state-of-the-art results across four heart-sound benchmarks, demonstrates exceptional data efficiency with a threefold reduction in labeled data, and generalizes to cross-domain tasks such as lung sounds and heart-rate estimation. These findings establish SS-DPPN as a robust, scalable foundation model for physiological signals with strong calibration and transferability, addressing annotation bottlenecks in medical AI.

Abstract

The automated analysis of phonocardiograms is vital for the early diagnosis of cardiovascular disease, yet supervised deep learning is often constrained by the scarcity of expert-annotated data. In this paper, we propose the Self-Supervised Dual-Path Prototypical Network (SS-DPPN), a foundation model for cardiac audio representation and classification from unlabeled data. The framework introduces a dual-path contrastive learning based architecture that simultaneously processes 1D waveforms and 2D spectrograms using a novel hybrid loss. For the downstream task, a metric-learning approach using a Prototypical Network was used that enhances sensitivity and produces well-calibrated and trustworthy predictions. SS-DPPN achieves state-of-the-art performance on four cardiac audio benchmarks. The framework demonstrates exceptional data efficiency with a fully supervised model on three-fold reduction in labeled data. Finally, the learned representations generalize successfully across lung sound classification and heart rate estimation. Our experiments and findings validate SS-DPPN as a robust, reliable, and scalable foundation model for physiological signals.

Paper Structure

This paper contains 22 sections, 12 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Graphical Abstract of SS-DPPN: Dual-path self-supervised framework for phonocardiogram (PCG). Unlabeled raw PCG signals are augmented and processed by complementary 1D encoder for waveforms and 2D encoder for spectrogram; trained with a hybrid loss; and fine-tuned using a prototypical network. The prototypical network then matches the test embeddings to prototypes for classification. The approach boosts robustness and generalizability across ages, devices, and recording sites, reduces label dependence and feature engineering, and enables early, accurate, and efficient heart-sound screening.
  • Figure 2: Calibration Analysis: a) Performance Comparison between SS-DPPN and Baseline, b) Bootstrap distribution- changes in AUPRC values across the run, c) Brier and Expected Calibration Error scores of SS-DPPN and Baseline, d) Calibration curve comparison (SS-DPPN closer means better), e) ROC Curve comparison, f) AUPR Curve Comparison.
  • Figure 3: a - tSNE plot for feature separation visualization of lung sound data(i) and prediction vs true class plot for heart rate estimation(ii). b - Comprehensive data efficiency analysis, showing the learning curves with confidence intervals(i), the trend of relative SSL improvement(ii), and the performance efficiency ratio (iii) against a supervised baseline.
  • Figure 4: t-SNE and UMAP feature separation visualization. To show the robustness and generalizability of the SS-DPPN, feature separation visualizations for four datasets are added in a, b, c and d, respectively.
  • Figure 5: Detailed Model Architecture. The proposed SS-DPPN model for heart sound classification has two major modules: one for self-supervised pretraining using dual path contrastive learning, and another is a prototypical network to calculate the probability of the class label.
  • ...and 12 more figures