Table of Contents
Fetching ...

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu

TL;DR

This work tackles dysarthric and elderly ASR under data scarcity by systematically integrating domain-adapted SSL foundation models (e.g., Wav2vec2.0, HuBERT, WavLM, Data2vec) into hybrid TDNN and Conformer systems. It introduces input feature fusion, frame-level joint decoding, and cross-system multi-pass rescoring, augmented by cross-domain acoustic-to-articulatory inversion to generate UTI-based articulatory features for multimodal ASR. The approach yields significant WER/CER reductions across English UASpeech, TORGO, DementiaBank Pitt, and Cantonese JCCOCC MoCA datasets, with best results of $WER=20.56 ext{\%}$ on UASpeech and $CER=18.07 ext{\%}$ on DementiaBank Pitt; AD detection accuracy also improves when using SSL-enhanced transcripts. The findings highlight the practical impact of cross-domain SSL representations for robust, scalable dysarthric and elderly speech recognition and point to future work on rapid personalization and model compression for deployment.

Abstract

Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

TL;DR

This work tackles dysarthric and elderly ASR under data scarcity by systematically integrating domain-adapted SSL foundation models (e.g., Wav2vec2.0, HuBERT, WavLM, Data2vec) into hybrid TDNN and Conformer systems. It introduces input feature fusion, frame-level joint decoding, and cross-system multi-pass rescoring, augmented by cross-domain acoustic-to-articulatory inversion to generate UTI-based articulatory features for multimodal ASR. The approach yields significant WER/CER reductions across English UASpeech, TORGO, DementiaBank Pitt, and Cantonese JCCOCC MoCA datasets, with best results of on UASpeech and on DementiaBank Pitt; AD detection accuracy also improves when using SSL-enhanced transcripts. The findings highlight the practical impact of cross-domain SSL representations for robust, scalable dysarthric and elderly speech recognition and point to future work on rapid personalization and model compression for deployment.

Abstract

Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
Paper Structure (30 sections, 9 equations, 2 figures, 10 tables)

This paper contains 30 sections, 9 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: An example of domain fine-tuned HuBERT feature based cross-domain acoustic-to-articulatory (A2A) inversion model architecture including: (1) the HuBERT encoder three-stage fine-tuned on out-of-domain 960-hour LibriSpeech, dysarthric UASpeech, and then the combined TaL+UASpeech audio data; (2) A2A model training using the HuBERT features extracted from TaL data and the parallel UTI-based articulatory features serving as the targets; (3) A2A inversion to generate UTI-based articulatory features using HuBERT features extracted from UASpeech.
  • Figure 2: Dysarthric/elderly speech fine-tuned Wav2vec2.0/HuBERT models containing a "Bottleneck Module" (located at one of three different positions via connections (e), (f) or (g)) used to extract domain-adapted speech features. These models and their features are integrated into TDNN/Conformer ASR systems trained on in-domain dysarthric/elderly speech only using: 1)input feature fusion with standard acoustic frontends via connections (a) and (b); 2) TDNN system frame-level joint decoding in the green box; and 3) TDNN/Conformer systems' N-best outputs multi-pass rescoring using domain fine-tuned SSL pre-trained models in the brown box, as presented in Sec. \ref{['sec:sys_integrate']}. Connections (c) and (d) produce acoustic-articulatory speech recognition (AASR) systems using additional articulatory features predicted from domain-adapted SSL speech features via A2A inversion of Sec. \ref{['sec:a2a_inversion']}. "Trans" denotes transformer.