Predicting Heart Activity from Speech using Data-driven and Knowledge-based features
Gasser Elbanna, Zohreh Mostaani, Mathew Magimai. -Doss
TL;DR
The paper addresses predicting heart activity parameters, specifically BPM and HRV, from speech and investigates how data-driven, self-supervised speech representations compare to traditional acoustic features. Using the Ulm-TSST dataset, it evaluates Hybrid BYOL-S against openSMILE features (eGeMAPS, ComParE) across speaker-independent, speaker-dependent, and speaker-specific splits with multiple regressors, reporting results via $R^2$ and Pearson's correlation $r$. The study finds that Hybrid BYOL-S provides superior predictive power in speaker-dependent settings, while generalization to unseen speakers remains limited; longer context windows (4–5 s) improve performance, and spectral features along with loudness emerge as key predictors. These findings support the potential of data-driven representations for speech-based physiological monitoring while highlighting the need for larger, more diverse datasets and explicit consideration of mental states in telemedicine applications.
Abstract
Accurately predicting heart activity and other biological signals is crucial for diagnosis and monitoring. Given that speech is an outcome of multiple physiological systems, a significant body of work studied the acoustic correlates of heart activity. Recently, self-supervised models have excelled in speech-related tasks compared to traditional acoustic methods. However, the robustness of data-driven representations in predicting heart activity remained unexplored. In this study, we demonstrate that self-supervised speech models outperform acoustic features in predicting heart activity parameters. We also emphasize the impact of individual variability on model generalizability. These findings underscore the value of data-driven representations in such tasks and the need for more speech-based physiological data to mitigate speaker-related challenges.
