Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation
Jingping Nie, Dung T. Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendano, Erdrin Azemi, Vikramjit Mitra
TL;DR
This work investigates whether self-supervised foundation models trained on broad audio data encode heart sound information suitable for heart-rate estimation from phonocardiograms. A layer-wise analysis of six FMs (HuBERT, wav2vec2, wavLM, Whisper, CLAP, and an in-house CLAP) on the CirCor DigiScope PCG dataset compares to a baseline acoustic-feature method. The main finding is that FM representations are largely competitive with the baseline, with the in-house CLAP audio encoder achieving the lowest MAE across splits (approximately $1.88$ bpm) and sometimes outperforming the baseline. The results support using FM-based representations for robust, transfer-friendly HR estimation from heart sounds and motivate future work on feature fusion, domain adaptation, and clinical extension.
Abstract
Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.
