Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu
TL;DR
This work tackles the limited data problem in dysarthric speech recognition by evaluating how data augmentation can improve fine-tuning of SSL pre-trained ASR models (Wav2vec2.0 and HuBERT). It compares conventional speed and speaker-dependent perturbations with adversarial approaches, namely DCGAN-based and Spectral basis GAN-based augmentation, across multiple model backbones and a multitask fine-tuning regime that includes impairment severity cues. The experiments on UASpeech show that GAN-based augmentations consistently outperform non-augmented and speed-perturbed baselines, with notable gains from Spectral basis GAN when data is expanded; system combination with rescoring achieves a new low published WER of 16.53% on UASpeech. This demonstrates a practical path to more robust dysarthric ASR with limited labeled data, potentially improving assistive communication systems for dysarthric speakers.
Abstract
Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.
