Table of Contents
Fetching ...

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li

TL;DR

The paper tackles the lack of Mandarin stuttered speech resources for ASR and SED by introducing AS-70, a large open dataset with 70 native AWS Mandarin speakers, covering conversational and command-reading tasks with verbatim, character-level transcription and privacy safeguards. It provides descriptive analyses including stuttering rate $SR$ and event-type distributions, and benchmarks ASR and SED using baselines (Conformer, HuBERT, Whisper for ASR; StutterNet, ConvLSTM, Conformer, wav2vec2.0 for SED), showing that fine-tuning on AS-70 yields notable performance gains. The work highlights dataset-specific insights (e.g., distribution of stuttering types across tasks) and demonstrates that non-Western, richly annotated resources can improve inclusivity in speech technologies. By making AS-70 openly available, the study aims to accelerate research on Mandarin stuttering detection and robust ASR for people who stutter, facilitating more accessible human–machine interfaces.

Abstract

The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. Encompassing conversational and voice command reading speech, AS-70 includes verbatim manual transcription, rendering it suitable for various speech-related tasks. Furthermore, baseline systems are established, and experimental results are presented for ASR and stuttering event detection (SED) tasks. By incorporating this dataset into the model fine-tuning, significant improvements in the state-of-the-art ASR models, e.g., Whisper and Hubert, are observed, enhancing their inclusivity in addressing stuttered speech.

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

TL;DR

The paper tackles the lack of Mandarin stuttered speech resources for ASR and SED by introducing AS-70, a large open dataset with 70 native AWS Mandarin speakers, covering conversational and command-reading tasks with verbatim, character-level transcription and privacy safeguards. It provides descriptive analyses including stuttering rate and event-type distributions, and benchmarks ASR and SED using baselines (Conformer, HuBERT, Whisper for ASR; StutterNet, ConvLSTM, Conformer, wav2vec2.0 for SED), showing that fine-tuning on AS-70 yields notable performance gains. The work highlights dataset-specific insights (e.g., distribution of stuttering types across tasks) and demonstrates that non-Western, richly annotated resources can improve inclusivity in speech technologies. By making AS-70 openly available, the study aims to accelerate research on Mandarin stuttering detection and robust ASR for people who stutter, facilitating more accessible human–machine interfaces.

Abstract

The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. Encompassing conversational and voice command reading speech, AS-70 includes verbatim manual transcription, rendering it suitable for various speech-related tasks. Furthermore, baseline systems are established, and experimental results are presented for ASR and stuttering event detection (SED) tasks. By incorporating this dataset into the model fine-tuning, significant improvements in the state-of-the-art ASR models, e.g., Whisper and Hubert, are observed, enhancing their inclusivity in addressing stuttered speech.
Paper Structure (11 sections, 5 tables)