Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Huimeng Wang; Zengrui Jin; Mengzhe Geng; Shujie Hu; Guinan Li; Tianzi Wang; Haoning Xu; Xunying Liu

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu

TL;DR

This work tackles the limited data problem in dysarthric speech recognition by evaluating how data augmentation can improve fine-tuning of SSL pre-trained ASR models (Wav2vec2.0 and HuBERT). It compares conventional speed and speaker-dependent perturbations with adversarial approaches, namely DCGAN-based and Spectral basis GAN-based augmentation, across multiple model backbones and a multitask fine-tuning regime that includes impairment severity cues. The experiments on UASpeech show that GAN-based augmentations consistently outperform non-augmented and speed-perturbed baselines, with notable gains from Spectral basis GAN when data is expanded; system combination with rescoring achieves a new low published WER of 16.53% on UASpeech. This demonstrates a practical path to more robust dysarthric ASR with limited labeled data, potentially improving assistive communication systems for dysarthric speakers.

Abstract

Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 1 figure, 2 tables)

This paper contains 17 sections, 4 equations, 1 figure, 2 tables.

Introduction
Pre-trained ASR systems
Pre-trained Wav2vec2.0 Model
Pre-trained HuBERT Model
Speech Impairment Severity Based Multitask Fine-tuning
Conventional Data Augmentation
Speed Perturbation Based Data Augmentation
Speaker Dependent Perturbation Based Augmentation
Adversarial Data Augmentation
DCGAN based Data Augmentation
Spectral basis GAN based Data Augmentation
Experiments
Task Description
Experiment Setup
Result Analysis
...and 2 more sections

Figures (1)

Figure 1: Illustration of (a) DCGAN model training on parallel control and dysarthric utterances with modified duration and time alignment; (b) DCGAN based speaker-dependent (SD) dysarthric speech generation using SD speed perturbed normal speech; (c) Spectral basis GAN model training on SVD decomposed non-parallel control and dysarthric speech; and (d) Spectral basis GAN based SD dysarthric speech generation by re-composition of perturbed control speech derived spectral basis vectors with their temporal bases.

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

TL;DR

Abstract

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)