Table of Contents
Fetching ...

Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition

Shujie Hu, Xurong Xie, Mengzhe Geng, Jiajun Deng, Zengrui Jin, Tianzi Wang, Mingyu Cui, Guinan Li, Zhaoqing Li, Helen Meng, Xunying Liu

TL;DR

The paper tackles the lack of robust ASR performance for dysarthric and elderly speech by addressing data bias and speaker variability through structured speaker-deficiency adaptation. It introduces supervised adaptive fine-tuning to obtain speaker- and deficiency-invariant foundation models and employs test-time unsupervised adaptation with separate adapters for speaker identity and speech impairment; this yields stronger starting points for adaptation. Across UASpeech and DementiaBank Pitt, the approach delivers statistically significant WER improvements, achieving up to 3.01% absolute reductions and setting new state-of-the-art results (e.g., 19.45% on UASpeech and 17.45% on DementiaBank). The work demonstrates a scalable path to robust, personalized foundation-model ASR for impaired and aging speech, with potential impact on assistive technologies and early cognitive impairment screening.

Abstract

Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.

Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition

TL;DR

The paper tackles the lack of robust ASR performance for dysarthric and elderly speech by addressing data bias and speaker variability through structured speaker-deficiency adaptation. It introduces supervised adaptive fine-tuning to obtain speaker- and deficiency-invariant foundation models and employs test-time unsupervised adaptation with separate adapters for speaker identity and speech impairment; this yields stronger starting points for adaptation. Across UASpeech and DementiaBank Pitt, the approach delivers statistically significant WER improvements, achieving up to 3.01% absolute reductions and setting new state-of-the-art results (e.g., 19.45% on UASpeech and 17.45% on DementiaBank). The work demonstrates a scalable path to robust, personalized foundation-model ASR for impaired and aging speech, with potential impact on assistive technologies and early cognitive impairment screening.

Abstract

Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.

Paper Structure

This paper contains 11 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Examples of SSL pre-trained SFM (grey box, sub-figure (a)) adaptation using either speaker identity alone via c) speaker-dependent LHUC, or d) structured speaker-deficiency adapters. The adapter can be inserted either 1) after the CNN encoder (sub-figure (a)); or 2) in a specific transformer block (sub-figure (b))."LN", "MHSA", "DP" and "FF" are layernorm, multi-head self-attention, dropout and feedforward modules.
  • Figure 2: Examples of i) adaptive training using structured speaker-deficiency adapters; ii) baseline fine-tuning; and iii) test time unsupervised adaptation using structured speaker-deficiency adapters. During adaptive training and test time adaptation, the parameters of the speech deficiency conditioned adapter and the speaker identity dependent adapter are estimated in turn in two stages.