Table of Contents
Fetching ...

Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction

Daniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland R. Barnard, David T. Jones, Hugo Botha

TL;DR

This study investigates how layer selection in a speech foundation model (wav2vec 2.0) affects the prediction of fine-grained pathological speech features from AMR and SMR tasks. It demonstrates that choosing the optimal layer yields large performance gains over using the final layer, with the best layer varying by feature; a learned weighted sum across layers offers a competitive, more generalizable alternative. Across in-distribution data, the best-layer approach improves balanced accuracy by roughly 16% over the worst layer, while the weighted sum closely matches the average best layer and generalizes better to out-of-distribution data. The findings highlight the importance of layer-wise representation choice in clinical speech applications and point to weighted-layer fusion as a robust strategy under data shifts, albeit with acknowledging data diversity and model generalizability limitations when deploying in practice.

Abstract

Accurately extracting clinical information from speech is critical to the diagnosis and treatment of many neurological conditions. As such, there is interest in leveraging AI for automatic, objective assessments of clinical speech to facilitate diagnosis and treatment of speech disorders. We explore transfer learning using foundation models, focusing on the impact of layer selection for the downstream task of predicting pathological speech features. We find that selecting an optimal layer can greatly improve performance (~15.8% increase in balanced accuracy per feature as compared to worst layer, ~13.6% increase as compared to final layer), though the best layer varies by predicted feature and does not always generalize well to unseen data. A learned weighted sum offers comparable performance to the average best layer in-distribution (only ~1.2% lower) and had strong generalization for out-of-distribution data (only 1.5% lower than the average best layer).

Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction

TL;DR

This study investigates how layer selection in a speech foundation model (wav2vec 2.0) affects the prediction of fine-grained pathological speech features from AMR and SMR tasks. It demonstrates that choosing the optimal layer yields large performance gains over using the final layer, with the best layer varying by feature; a learned weighted sum across layers offers a competitive, more generalizable alternative. Across in-distribution data, the best-layer approach improves balanced accuracy by roughly 16% over the worst layer, while the weighted sum closely matches the average best layer and generalizes better to out-of-distribution data. The findings highlight the importance of layer-wise representation choice in clinical speech applications and point to weighted-layer fusion as a robust strategy under data shifts, albeit with acknowledging data diversity and model generalizability limitations when deploying in practice.

Abstract

Accurately extracting clinical information from speech is critical to the diagnosis and treatment of many neurological conditions. As such, there is interest in leveraging AI for automatic, objective assessments of clinical speech to facilitate diagnosis and treatment of speech disorders. We explore transfer learning using foundation models, focusing on the impact of layer selection for the downstream task of predicting pathological speech features. We find that selecting an optimal layer can greatly improve performance (~15.8% increase in balanced accuracy per feature as compared to worst layer, ~13.6% increase as compared to final layer), though the best layer varies by predicted feature and does not always generalize well to unseen data. A learned weighted sum offers comparable performance to the average best layer in-distribution (only ~1.2% lower) and had strong generalization for out-of-distribution data (only 1.5% lower than the average best layer).
Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of model architecture. To predict the presence of pathological speech features, we extracted either (a) a single layer from wav2vec 2.0 or (c) combined all layers in a learnable weighted sum then fed the resulting representations to (c) a single classifier or (d) multiple classifiers, with an optional shared dense layer prior to the classifier.
  • Figure 2: Comparing balanced accuracy by learning rate and layer across the predicted pathological speech features, with 95% confidence intervals (AMR).
  • Figure 3: Comparing balanced accuracy across layers for each predicted pathological speech feature (AMR).
  • Figure 4: Comparing balanced accuracy across best and worst layers and weighted sum for each predicted pathological speech feature, with 95% confidence intervals (AMR).
  • Figure 5: Comparing balanced accuracy across layers for each predicted pathological speech feature (SMR).