Table of Contents
Fetching ...

Predicting Heart Activity from Speech using Data-driven and Knowledge-based features

Gasser Elbanna, Zohreh Mostaani, Mathew Magimai. -Doss

TL;DR

The paper addresses predicting heart activity parameters, specifically BPM and HRV, from speech and investigates how data-driven, self-supervised speech representations compare to traditional acoustic features. Using the Ulm-TSST dataset, it evaluates Hybrid BYOL-S against openSMILE features (eGeMAPS, ComParE) across speaker-independent, speaker-dependent, and speaker-specific splits with multiple regressors, reporting results via $R^2$ and Pearson's correlation $r$. The study finds that Hybrid BYOL-S provides superior predictive power in speaker-dependent settings, while generalization to unseen speakers remains limited; longer context windows (4–5 s) improve performance, and spectral features along with loudness emerge as key predictors. These findings support the potential of data-driven representations for speech-based physiological monitoring while highlighting the need for larger, more diverse datasets and explicit consideration of mental states in telemedicine applications.

Abstract

Accurately predicting heart activity and other biological signals is crucial for diagnosis and monitoring. Given that speech is an outcome of multiple physiological systems, a significant body of work studied the acoustic correlates of heart activity. Recently, self-supervised models have excelled in speech-related tasks compared to traditional acoustic methods. However, the robustness of data-driven representations in predicting heart activity remained unexplored. In this study, we demonstrate that self-supervised speech models outperform acoustic features in predicting heart activity parameters. We also emphasize the impact of individual variability on model generalizability. These findings underscore the value of data-driven representations in such tasks and the need for more speech-based physiological data to mitigate speaker-related challenges.

Predicting Heart Activity from Speech using Data-driven and Knowledge-based features

TL;DR

The paper addresses predicting heart activity parameters, specifically BPM and HRV, from speech and investigates how data-driven, self-supervised speech representations compare to traditional acoustic features. Using the Ulm-TSST dataset, it evaluates Hybrid BYOL-S against openSMILE features (eGeMAPS, ComParE) across speaker-independent, speaker-dependent, and speaker-specific splits with multiple regressors, reporting results via and Pearson's correlation . The study finds that Hybrid BYOL-S provides superior predictive power in speaker-dependent settings, while generalization to unseen speakers remains limited; longer context windows (4–5 s) improve performance, and spectral features along with loudness emerge as key predictors. These findings support the potential of data-driven representations for speech-based physiological monitoring while highlighting the need for larger, more diverse datasets and explicit consideration of mental states in telemedicine applications.

Abstract

Accurately predicting heart activity and other biological signals is crucial for diagnosis and monitoring. Given that speech is an outcome of multiple physiological systems, a significant body of work studied the acoustic correlates of heart activity. Recently, self-supervised models have excelled in speech-related tasks compared to traditional acoustic methods. However, the robustness of data-driven representations in predicting heart activity remained unexplored. In this study, we demonstrate that self-supervised speech models outperform acoustic features in predicting heart activity parameters. We also emphasize the impact of individual variability on model generalizability. These findings underscore the value of data-driven representations in such tasks and the need for more speech-based physiological data to mitigate speaker-related challenges.
Paper Structure (9 sections, 6 figures)

This paper contains 9 sections, 6 figures.

Figures (6)

  • Figure 1: Training pipeline for predicting BPM and HRV values from knowledge-based and data-driven speech representations.
  • Figure 2: Performance of different speech features in both speaker conditions. The reported distributions show the evaluation across multiple regressors and window sizes as well as the performance for predicting both targets (i.e. BPM and HRV).
  • Figure 3: Performance of different speech features with varying context window duration. The reported distributions show the evaluation using speaker-dependent and GBT regression model for predicting both targets (i.e., BPM and HRV).
  • Figure 4: Predictions from GBT model using Hybrid BYOL-S features with 5 sec window size. Predictions are shown for speakers 52, 13, and 46, respectively.
  • Figure 5: Performance of Hybrid BYOL-S features for speaker-specific protocol across all window sizes (3, 4, 5 seconds) of audio and a GBT regressor model. The figure shows a random sample of 20 speakers for both targets BPM and HRV
  • ...and 1 more figures