Table of Contents
Fetching ...

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li, Gang Yang, Rong Sheng, Yili Xia, Ming Chu

TL;DR

The superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.

Abstract

Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

TL;DR

The superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.

Abstract

Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.
Paper Structure (29 sections, 12 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 12 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Conceptual framework of the Longitudinal Intra-Patient Tracking (LIPT) paradigm for personalised HF monitoring. Unlike standard cross-sectional classification, which suffers from low accuracy due to inter-subject heterogeneity, the LIPT pipeline directly models intra-individual temporal sequences. The workflow comprises three primary stages: (a) Feature Extraction, in which speech signals are characterised by spectral (time-frequency), rhythmic, and glottal quality attributes at both global and frame-wise levels; (b) HF-Voice Feature Screening, which employs statistical significance tests to identify features with pathophysiological associations to HF, while correlations between various feature categories and the disease state are evaluated; and (c) Longitudinal Intra-Patient Tracking (LIPT), where sequential voice samples from each individual are processed by a Personalised Sequential Encoder (PSE) to capture longitudinal dependencies. Finally, a longitudinal classifier generates individualised tracking results, which are validated against clinical gold standards to monitor disease trajectories over time.
  • Figure 2: Performance of different speech tasks with the PSE model. The long sentence task 'count' had the best macro-F1, followed by short sentences ('pg', 'mm', and 'mlh') and vowels ('a', 'i', and 'u'). The long sentence task's specificity was also the highest, albeit the sensitivity was slightly lower than that of the short sentences 'mm' and 'mlh'. These two short sentences, focusing on voiced consonants, are also relatively balanced in sensitivity and specificity.
  • Figure 3: Confusion matrix and Receiver Operating Characteristic (ROC) curve of the PSE on the follow-up data set, with the best settings overall (RASTA features with HF-voice feature sets A and B (see Section \ref{['statistical']})), with PSE models reported in section \ref{['pse_result']}. The most outstanding category of error is the false positive, indicating high confusion regarding the stable (non-hospitalised) class, where similar data is lacking in training. The relatively high area under the ROC curve (AUROC) suggests these errors stem from output misalignment rather than a lack of discriminative power.
  • Figure 4: Flowchart of the data collection process.
  • Figure 5: Architecture of the Personalised Sequential Encoder (PSE) and Longitudinal Classifier for speech-based monitoring. The framework consists of three operational stages: (1) Pretraining: The encoder acquires preliminary knowledge of speech patterns by reconstructing frame-level feature maps via a decoder. (2) Longitudinal Classifier Training: Frame-level features ($X_{0} \dots X_{T-1}$) are processed through a convolutional network and aggregated with the global features to form the temporally-interpolated latent representations ($Z_{0} \dots Z_{T-1}$). The system performs reference/target allocation to compare any pair of time points ($Z_{t1}, Z_{t2}$), using a linear layer to produce local comparison results ($\hat{y} \in [0,1]$) optimised against clinical true labels. (3) Evaluation: The model integrates medical records and time stamps to produce final personal tracking results.
  • ...and 2 more figures