Table of Contents
Fetching ...

Towards Hierarchical Spoken Language Dysfluency Modeling

Jiachen Lian, Gopala Anumanchipalli

TL;DR

The paper addresses the bottleneck in disfluency modeling by proposing Hierarchical Unconstrained Disfluency Modeling (H-UDM), a two-module framework that couples a Transcription Module (URFA with 2D-Alignment and Text Refresher) and a Detection Module based on template matching. It introduces monotonicity via $CTC$ and recursive inference to improve phonetic and word-level transcription and disfluency detection, evaluated on disordered and aphasia-like speech with datasets such as Buckeye and VCTK/VCTK++. The results show substantial gains over baselines, including improved $dPER$, iWER, and F1/Matching Score metrics, highlighting the practical impact for speech therapy and language learning. While promising, the work also notes limitations in handling disordered speech and open-domain disfluencies, pointing to future work on end-to-end models and alternative speech units to broaden applicability.

Abstract

Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency transcription and detection to eliminate the need for extensive manual annotation. Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced, encompassing both transcription and detection tasks.

Towards Hierarchical Spoken Language Dysfluency Modeling

TL;DR

The paper addresses the bottleneck in disfluency modeling by proposing Hierarchical Unconstrained Disfluency Modeling (H-UDM), a two-module framework that couples a Transcription Module (URFA with 2D-Alignment and Text Refresher) and a Detection Module based on template matching. It introduces monotonicity via and recursive inference to improve phonetic and word-level transcription and disfluency detection, evaluated on disordered and aphasia-like speech with datasets such as Buckeye and VCTK/VCTK++. The results show substantial gains over baselines, including improved , iWER, and F1/Matching Score metrics, highlighting the practical impact for speech therapy and language learning. While promising, the work also notes limitations in handling disordered speech and open-domain disfluencies, pointing to future work on end-to-end models and alternative speech units to broaden applicability.

Abstract

Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency transcription and detection to eliminate the need for extensive manual annotation. Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced, encompassing both transcription and detection tasks.
Paper Structure (23 sections, 9 figures, 5 tables)

This paper contains 23 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Hierarchical Unconstrained disfluency Modeling(H-UDM) consists of Transcription module and Detection module. Both word-level and phoneme-level disfluencies are detected and localized. Here is an example of aphasia speech. The reference text is "You wish to know all about my grandfather," while the real/human transcription differs significantly from the reference. Whisper radford2022whisper recognizes it as perfect speech, while H-UDM is able to capture most of the disfluency patterns. An audio sample of this can be found here.
  • Figure 2: Unconstrained Recursive Forced Aligner consists of three basic modules: UFA, 2D alignment Search, Smoothed Re-segmentation. In the first iteration (Zero-order), the entire utterance is taken and 2D alignment is generated. Starting at 2nd iteration (1st-order), the disfluent speech is segmented at word level and each segment is processed separately and then combined to generate the final 2D alignment for detection.
  • Figure 3: 2D-Alignment Modeling
  • Figure 4: Scaling law for ASR under various conditions. (i) Perfect ASR (p-ASR); (ii) Imperfect ASR(i-ASR); (iii) Overall ASR(o-ASR)
  • Figure 5: Segmentation-(Dyslexia Sample: Giving those who observe him)
  • ...and 4 more figures