Table of Contents
Fetching ...

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

Jingping Nie, Dung T. Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendano, Erdrin Azemi, Vikramjit Mitra

TL;DR

This work investigates whether self-supervised foundation models trained on broad audio data encode heart sound information suitable for heart-rate estimation from phonocardiograms. A layer-wise analysis of six FMs (HuBERT, wav2vec2, wavLM, Whisper, CLAP, and an in-house CLAP) on the CirCor DigiScope PCG dataset compares to a baseline acoustic-feature method. The main finding is that FM representations are largely competitive with the baseline, with the in-house CLAP audio encoder achieving the lowest MAE across splits (approximately $1.88$ bpm) and sometimes outperforming the baseline. The results support using FM-based representations for robust, transfer-friendly HR estimation from heart sounds and motivate future work on feature fusion, domain adaptation, and clinical extension.

Abstract

Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

TL;DR

This work investigates whether self-supervised foundation models trained on broad audio data encode heart sound information suitable for heart-rate estimation from phonocardiograms. A layer-wise analysis of six FMs (HuBERT, wav2vec2, wavLM, Whisper, CLAP, and an in-house CLAP) on the CirCor DigiScope PCG dataset compares to a baseline acoustic-feature method. The main finding is that FM representations are largely competitive with the baseline, with the in-house CLAP audio encoder achieving the lowest MAE across splits (approximately bpm) and sometimes outperforming the baseline. The results support using FM-based representations for robust, transfer-friendly HR estimation from heart sounds and motivate future work on feature fusion, domain adaptation, and clinical extension.

Abstract

Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.

Paper Structure

This paper contains 6 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The overall goal of this project.
  • Figure 2: The data preparation process.
  • Figure 3: The heart rate distribution for 6 training, validation, and test splits and the number of unique subjects.
  • Figure 4: The representation vectors of $n^{th}$ embedding layer in the audio encoder of the foundation model are passed into a downstream 2D convolutional neural network (2dCNN) for HR estimation.
  • Figure 5: The Mean Absolute Error ($MAE_{i,j}$) across six data splits for different feature models (FMs) at various embedding layers.
  • ...and 2 more figures