Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou; Jiachen Lian; Cheol Jun Cho; Jingwen Liu; Zongli Ye; Jinming Zhang; Brittany Morin; David Baquirin; Jet Vonk; Zoe Ezzes; Zachary Miller; Maria Luisa Gorno Tempini; Gopala Anumanchipalli

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

TL;DR

This work proposes rule-based speech and text dysfluency simulators and develop VCTK-token, and develops a Whisper-like seq2seq architecture to build a new benchmark with decent performance and proposes a unified benchmark to facilitate future research endeavors.

Abstract

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

TL;DR

Abstract

Paper Structure (21 sections, 2 figures, 4 tables)

This paper contains 21 sections, 2 figures, 4 tables.

Introduction
Time-based Dysfluency Detection
YOLO-Stutter
YOLO-Stutter-LCS
Token-based Dysfluency Detection
Dysfluency Simulation
Text Simulator
Speech Simulator
Speech Dysfluency Detector
Feature Extractor
Dysfluent Text Tokenizer
Whisper Detector
Experiments
Datasets
Metrics
...and 6 more sections

Figures (2)

Figure 1: Comparison of Time-based and Token-based Methods
Figure 2: We begin by using a text simulator to convert the reference text into annotated text, either at the word or phoneme level. Next, we generate corresponding dysfluent speech through a speech simulator. The Whisper feature extractor processes the resulting audio waveform, while the Whisper Tokenizer or CMU Phoneme Tokenizer handles the word or phoneme-level annotated text. These audio and text representations are then fed into a Whisper-like encoder-decoder architecture for training and prediction. a) shows the entire pipeline for token-based dysfluency detection. b) illustrates the rules for injecting tokenizing dysfluency into the text space. Here are audio samples of dysfluent speech https://bit.ly/3XI5CTu

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

TL;DR

Abstract

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)