Table of Contents
Fetching ...

Unified Acoustic Representations for Screening Neurological and Respiratory Pathologies from Voice

Ran Piao, Yuan Lu, Hareld Kemps, Tong Xia, Aaqib Saeed

TL;DR

This work tackles scalable, privacy-preserving voice-based screening across neurological, respiratory, and vocal disorders by introducing MARVEL, a multitask framework with dual-modal inputs (MFCCs and log-Mel spectrograms) and task-specific heads. The model leverages specialized encoders and a shared latent representation to enable cross-task knowledge transfer, evaluated on Bridge2AI-Voice v2.0 where it achieves strong AUROC scores, notably 0.97 for AD/MCI and 0.89 for airway stenosis. MARVEL outperforms single-task baselines and several self-supervised models on most tasks, with embeddings that correlate meaningfully with handcrafted acoustic biomarkers, enhancing interpretability. The results support deploying unified, privacy-conscious voice-based diagnostics in remote or resource-constrained settings and highlight directions for future improvements such as SSL pretraining and uncertainty quantification.

Abstract

Voice-based health assessment offers unprecedented opportunities for scalable, non-invasive disease screening, yet existing approaches typically focus on single conditions and fail to leverage the rich, multi-faceted information embedded in speech. We present MARVEL (Multi-task Acoustic Representations for Voice-based Health Analysis), a privacy-conscious multitask learning framework that simultaneously detects nine distinct neurological, respiratory, and voice disorders using only derived acoustic features, eliminating the need for raw audio transmission. Our dual-branch architecture employs specialized encoders with task-specific heads sharing a common acoustic backbone, enabling effective cross-condition knowledge transfer. Evaluated on the large-scale Bridge2AI-Voice v2.0 dataset, MARVEL achieves an overall AUROC of 0.78, with exceptional performance on neurological disorders (AUROC = 0.89), particularly for Alzheimer's disease/mild cognitive impairment (AUROC = 0.97). Our framework consistently outperforms single-modal baselines by 5-19% and surpasses state-of-the-art self-supervised models on 7 of 9 tasks, while correlation analysis reveals that the learned representations exhibit meaningful similarities with established acoustic features, indicating that the model's internal representations are consistent with clinically recognized acoustic patterns. By demonstrating that a single unified model can effectively screen for diverse conditions, this work establishes a foundation for deployable voice-based diagnostics in resource-constrained and remote healthcare settings.

Unified Acoustic Representations for Screening Neurological and Respiratory Pathologies from Voice

TL;DR

This work tackles scalable, privacy-preserving voice-based screening across neurological, respiratory, and vocal disorders by introducing MARVEL, a multitask framework with dual-modal inputs (MFCCs and log-Mel spectrograms) and task-specific heads. The model leverages specialized encoders and a shared latent representation to enable cross-task knowledge transfer, evaluated on Bridge2AI-Voice v2.0 where it achieves strong AUROC scores, notably 0.97 for AD/MCI and 0.89 for airway stenosis. MARVEL outperforms single-task baselines and several self-supervised models on most tasks, with embeddings that correlate meaningfully with handcrafted acoustic biomarkers, enhancing interpretability. The results support deploying unified, privacy-conscious voice-based diagnostics in remote or resource-constrained settings and highlight directions for future improvements such as SSL pretraining and uncertainty quantification.

Abstract

Voice-based health assessment offers unprecedented opportunities for scalable, non-invasive disease screening, yet existing approaches typically focus on single conditions and fail to leverage the rich, multi-faceted information embedded in speech. We present MARVEL (Multi-task Acoustic Representations for Voice-based Health Analysis), a privacy-conscious multitask learning framework that simultaneously detects nine distinct neurological, respiratory, and voice disorders using only derived acoustic features, eliminating the need for raw audio transmission. Our dual-branch architecture employs specialized encoders with task-specific heads sharing a common acoustic backbone, enabling effective cross-condition knowledge transfer. Evaluated on the large-scale Bridge2AI-Voice v2.0 dataset, MARVEL achieves an overall AUROC of 0.78, with exceptional performance on neurological disorders (AUROC = 0.89), particularly for Alzheimer's disease/mild cognitive impairment (AUROC = 0.97). Our framework consistently outperforms single-modal baselines by 5-19% and surpasses state-of-the-art self-supervised models on 7 of 9 tasks, while correlation analysis reveals that the learned representations exhibit meaningful similarities with established acoustic features, indicating that the model's internal representations are consistent with clinically recognized acoustic patterns. By demonstrating that a single unified model can effectively screen for diverse conditions, this work establishes a foundation for deployable voice-based diagnostics in resource-constrained and remote healthcare settings.

Paper Structure

This paper contains 28 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of our proposed multi-task dual-modal MARVEL framework.
  • Figure 2: Representative log-Mel spectrograms across nine voice-related disorders. Each panel illustrates a speech recording from one participant, showcasing condition-specific spectral patterns such as disrupted harmonics, reduced energy in higher frequencies, or atypical temporal structure. These visual differences underscore the diagnostic relevance of time–frequency representations in voice-based disease detection.
  • Figure 3: Task-level ROC curves of the proposed multitask voice-based disease classifier across nine subtype disorders.
  • Figure 4: t-SNE visualization of penultimate-layer embeddings from the our model. Left: MCI vs. control; Right: Parkinson’s vs. control. Clear separation in the MCI task suggests strong latent discriminability, while Parkinson’s exhibits more subtle structure.
  • Figure 5: Top-5 handcrafted acoustic features most correlated with model embeddings across clinical tasks. For each task, bars show the absolute Pearson correlation between acoustic features and their most strongly associated dimension in the model's final-layer embeddings (computed on test set).
  • ...and 1 more figures