Table of Contents
Fetching ...

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Jeehyun Lee, Yerin Choi, Tae-Jin Song, Myoung-Wan Koo

TL;DR

The paper tackles automatic detection of Inappropriate Pauses ($IP$) in dysarthric speech to aid post-stroke rehabilitation. It reframes IP detection as an end-to-end ASR task by tagging pauses in transcripts and adding an $IP$ prediction layer within a Whisper-based Seq2Seq model, using clinically guided labeling at the text level. The authors introduce a data labeling scheme with <SIL> markers, a three-way IP annotation (0/1/2), and a task-specific evaluation metric, demonstrating superior pause detection (PauER) and IP detection (IPER) while maintaining ASR performance. The approach is scalable, robust across dysarthria severity, and has potential for language-agnostic clinical deployment to support therapy and diagnosis. $PauER$ and $IPER$ performance indicate practical utility in automated feedback for dysarthria therapy.

Abstract

Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

TL;DR

The paper tackles automatic detection of Inappropriate Pauses () in dysarthric speech to aid post-stroke rehabilitation. It reframes IP detection as an end-to-end ASR task by tagging pauses in transcripts and adding an prediction layer within a Whisper-based Seq2Seq model, using clinically guided labeling at the text level. The authors introduce a data labeling scheme with <SIL> markers, a three-way IP annotation (0/1/2), and a task-specific evaluation metric, demonstrating superior pause detection (PauER) and IP detection (IPER) while maintaining ASR performance. The approach is scalable, robust across dysarthria severity, and has potential for language-agnostic clinical deployment to support therapy and diagnosis. and performance indicate practical utility in automated feedback for dysarthria therapy.

Abstract

Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)
Paper Structure (11 sections, 2 equations, 3 figures, 3 tables)

This paper contains 11 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Examples of TIMIT-style labeling (left) and the proposed labeling (right). The samples are from the Korean dysarthric speech corpus.
  • Figure 2: The proposed inappropriate pause detection model architecture. Above the whisper decoder layers, there are two task-specific layers: The inappropriate Pause Prediction layer (IP Prediction layer) and the Transcript Prediction Layer.
  • Figure 3: Example of pause sequences for calculation. Above is an example of a sequence with pauses to see if the model measures pauses well, and below is an example of a sequence to see if the model measures inappropriate pauses well.