Towards Hierarchical Spoken Language Dysfluency Modeling
Jiachen Lian, Gopala Anumanchipalli
TL;DR
The paper addresses the bottleneck in disfluency modeling by proposing Hierarchical Unconstrained Disfluency Modeling (H-UDM), a two-module framework that couples a Transcription Module (URFA with 2D-Alignment and Text Refresher) and a Detection Module based on template matching. It introduces monotonicity via $CTC$ and recursive inference to improve phonetic and word-level transcription and disfluency detection, evaluated on disordered and aphasia-like speech with datasets such as Buckeye and VCTK/VCTK++. The results show substantial gains over baselines, including improved $dPER$, iWER, and F1/Matching Score metrics, highlighting the practical impact for speech therapy and language learning. While promising, the work also notes limitations in handling disordered speech and open-domain disfluencies, pointing to future work on end-to-end models and alternative speech units to broaden applicability.
Abstract
Speech disfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no effective AI solution to systematically tackle this problem. We solidify the concept of disfluent speech and disfluent speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM that addresses both disfluency transcription and detection to eliminate the need for extensive manual annotation. Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced, encompassing both transcription and detection tasks.
