Table of Contents
Fetching ...

WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech Health Diagnostic Model

Yi Zhu, Tiago Falk

TL;DR

WavRx tackles the challenge of disease-agnostic, privacy-preserving speech health diagnostics by integrating a universal temporal representation from WavLM with a long-range modulation dynamics block. The model achieves state-of-the-art or near-SOTA performance on six pathological datasets across four diseases, and demonstrates notable zero-shot generalization to unseen datasets. Importantly, the modulation dynamics component significantly reduces speaker-identity leakage in health embeddings while preserving diagnostic accuracy, as evidenced by privacy and sparsity analyses. Low-frequency modulations (<2 Hz) primarily drive discriminative power, providing physiological insight into the model's generalization and interpretability. Overall, WavRx establishes a promising, privacy-aware framework for broad, cross-dataset health assessment from speech, with potential as a new benchmark for health-diagnostic tasks.

Abstract

Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.

WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech Health Diagnostic Model

TL;DR

WavRx tackles the challenge of disease-agnostic, privacy-preserving speech health diagnostics by integrating a universal temporal representation from WavLM with a long-range modulation dynamics block. The model achieves state-of-the-art or near-SOTA performance on six pathological datasets across four diseases, and demonstrates notable zero-shot generalization to unseen datasets. Importantly, the modulation dynamics component significantly reduces speaker-identity leakage in health embeddings while preserving diagnostic accuracy, as evidenced by privacy and sparsity analyses. Low-frequency modulations (<2 Hz) primarily drive discriminative power, providing physiological insight into the model's generalization and interpretability. Overall, WavRx establishes a promising, privacy-aware framework for broad, cross-dataset health assessment from speech, with potential as a new benchmark for health-diagnostic tasks.

Abstract

Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.
Paper Structure (31 sections, 3 equations, 6 figures, 6 tables)

This paper contains 31 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Architecture of the proposed WavRx model.
  • Figure 2: The modulation dynamics block takes the weighted sum of hidden states from the WavLM transformer backbone and applies STFT to each feature channel.
  • Figure 3: Average F1 scores achieved with different model design choices. The starred ones are the adopted design choices.
  • Figure 4: Projected health embeddings learned from temporal representations (left) and dynamic representations (right).
  • Figure 5: F-ratio plots computed between the modulation dynamics of positive and negative samples obtained for each of the six datasets. X-axis shows the modulation frequency (in Hz) and Y-axis represents the feature dimension, which contains 768 features in total. Zoom in on the brighter areas to locate the frequencies, where higher discrimination is obtained between two classes.
  • ...and 1 more figures