Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

Dena Mujtaba; Nihar R. Mahapatra; Megan Arney; J. Scott Yaruss; Caryn Herring; Jia Bin

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin

TL;DR

This work presents an inclusive ASR design approach, leveraging large-scale self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation on a smaller, curated dataset of disfluent speech.

Abstract

Automatic speech recognition (ASR) systems often falter while processing stuttering-related disfluencies -- such as involuntary blocks and word repetitions -- yielding inaccurate transcripts. A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets. Therefore, we present an inclusive ASR design approach, leveraging large-scale self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation on a smaller, curated dataset of disfluent speech. Our data augmentation technique enriches training datasets with various disfluencies, enhancing ASR processing of these speech patterns. Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech. Our approach not only advances ASR inclusivity for people who stutter, but also paves the way for ASRs that can accommodate wider speech variations.

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 3 figures, 2 tables)

This paper contains 13 sections, 1 equation, 3 figures, 2 tables.

Introduction
Related Work
Our Contributions
Methodology
wav2vec 2.0
FluencyBank
Non-Disfluent Speech Datasets
Data Augmentation for Disfluent Speech
Evaluation Metrics
Implementation
Results & Discussion
Conclusion
Acknowledgements

Figures (3)

Figure 1: Schematic of our method integrating data augmentation and fine-tuning of wav2vec 2.0 for stuttered speech, exemplified by augmentations from a FluencyBank speech sample ratner2018fluency. In wav2vec 2.0, $\mathcal{L}$ denotes the loss function, $\mathcal{C}$ represents context representations, $\mathcal{Q}$ indicates quantized representations, and $\mathcal{Z}$ corresponds to latent speech representations.
Figure 2: Distribution of WER for four of our models per disfluency type: wav2vec 2.0 (W2V2) tested on FluencyBank (FB), W2V2 tested on FluencyBank-N (FBN), W2V2 fine-tuned (FT) on FB and tested on FB, and W2V2 fine-tuned with $N=6000$ additional samples (i.e., $p=437$) tested on FB.
Figure 3: Plot illustrating WER per speaker on FluencyBank (FB) before and after model fine-tuning with data augmentation, as well as model accuracy on non-disfluent speech, FluencyBank-N (FBN). Speakers are represented by their age and gender, as identified in the FluencyBank dataset. Thus, "26f" represents a 26-year old female.

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

TL;DR

Abstract

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)