Table of Contents
Fetching ...

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

TL;DR

Dysfluency transcription and detection remain challenging when relying on binary classification or text-independent models. The authors propose Dysfluent-WFST, a zero-shot WFST-based decoder that jointly transcribes phonemes and detects dysfluencies using encoder emissions without additional training, compatible with models like WavLM. The approach leverages pronunciation priors through dynamic weighting and a dysfluency-aware decoding graph to achieve state-of-the-art phonetic error rate and dysfluency detection on simulated and real nfvPPA data, with strong interpretability and efficiency. While effective for repetitions, it shows limitations for insertions/deletions and remains non-differentiable, prompting future work on joint training with encoders and incorporating articulatory feedback to further enhance performance and robustness.

Abstract

Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

TL;DR

Dysfluency transcription and detection remain challenging when relying on binary classification or text-independent models. The authors propose Dysfluent-WFST, a zero-shot WFST-based decoder that jointly transcribes phonemes and detects dysfluencies using encoder emissions without additional training, compatible with models like WavLM. The approach leverages pronunciation priors through dynamic weighting and a dysfluency-aware decoding graph to achieve state-of-the-art phonetic error rate and dysfluency detection on simulated and real nfvPPA data, with strong interpretability and efficiency. While effective for repetitions, it shows limitations for insertions/deletions and remains non-differentiable, prompting future work on joint training with encoders and incorporating articulatory feedback to further enhance performance and robustness.

Abstract

Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.

Paper Structure

This paper contains 11 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustrating Dysfluency Transitions: Repetition, Deletion, and Insertion in WFST
  • Figure 2: Decoder Workflow Based on WFST: The framework takes as input the speech signal and the corresponding reference text, and outputs the phoneme transcription sequence along with dysfluency detection (e.g., repetition). In this example, the reference text is "She's not here," while the spoken audio is "She's n-not (N AA N AA T) here." When using greedy search, the model may incorrectly force-align the repetition part to non-monotonic phonemes, such as "D." In contrast, our WFST-based method incorporates a return arc, enabling the repetition of phonemes present in the reference text. As the shortest path traverses this return arc, the output is labeled as "5<trans>3," indicating a repetition in speech. Consequently, our WFST-based decoder successfully outputs the correct phoneme transcription along with accurate dysfluency detection.
  • Figure 3: Impact of varying $\beta$ values on transcription performance on different dataset