Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

Xuanru Zhou; Cheol Jun Cho; Ayati Sharma; Brittany Morin; David Baquirin; Jet Vonk; Zoe Ezzes; Zachary Miller; Boon Lead Tee; Maria Luisa Gorno Tempini; Jiachen Lian; Gopala Anumanchipalli

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

Xuanru Zhou, Cheol Jun Cho, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Boon Lead Tee, Maria Luisa Gorno Tempini, Jiachen Lian, Gopala Anumanchipalli

TL;DR

Stutter-Solver is proposed: an end-toend framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO object detection algorithm, and achieves state-of-the-art performance on all available dysfluency corpora.

Abstract

Current de-facto dysfluency modeling methods utilize template matching algorithms which are not generalizable to out-of-domain real-world dysfluencies across languages, and are not scalable with increasing amounts of training data. To handle these problems, we propose Stutter-Solver: an end-to-end framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO object detection algorithm. Stutter-Solver can handle co-dysfluencies and is a natural multi-lingual dysfluency detector. To leverage scalability and boost performance, we also introduce three novel dysfluency corpora: VCTK-Pro, VCTK-Art, and AISHELL3-Pro, simulating natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation through articulatory-encodec and TTS-based methods. Our approach achieves state-of-the-art performance on all available dysfluency corpora. Code and datasets are open-sourced at https://github.com/eureka235/Stutter-Solver

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

TL;DR

Abstract

Paper Structure (24 sections, 3 figures, 8 tables)

This paper contains 24 sections, 3 figures, 8 tables.

Introduction
Articulatory-based Simulation
Method Pipeline
Multi-Lingual TTS-based Simulation
Method pipeline
Co-Dysfluency TTS rules
Dysfluency Detection as Object Detection
Soft speech-text alignments
Spatial-Temporal Encoders
Spatial Encoder
Temporal Encoder
Training Loss
Experiments
Datasets
Training
...and 9 more sections

Figures (3)

Figure 1: We utilize the pretrained VITS speech and text encoders to process spectrogram and reference text respectively, generating the soft speech-text alignments passed into the detector. The output matrix contains exist confidence score and 5 types of type confidence scores (start & end bounds are left out in the paradigm). The higher the brightness, the higher the score, indicating the existence and type of dysfluency. a) shows the series nature of our detector with spatial encoder and subsequent temporal encoder. b) is a diagram for a spatial encoder block - grouped convolutions are important for extracting local spatial features without completely collapsing information across the text dimension.
Figure 2: Pipeline of articulatory-based simulation.
Figure 3: TTS rules for VCTK-Pro and AISHELL3-Pro

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

TL;DR

Abstract

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)