Table of Contents
Fetching ...

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

TL;DR

This work proposes rule-based speech and text dysfluency simulators and develop VCTK-token, and develops a Whisper-like seq2seq architecture to build a new benchmark with decent performance and proposes a unified benchmark to facilitate future research endeavors.

Abstract

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

TL;DR

This work proposes rule-based speech and text dysfluency simulators and develop VCTK-token, and develops a Whisper-like seq2seq architecture to build a new benchmark with decent performance and proposes a unified benchmark to facilitate future research endeavors.

Abstract

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/
Paper Structure (21 sections, 2 figures, 4 tables)

This paper contains 21 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Comparison of Time-based and Token-based Methods
  • Figure 2: We begin by using a text simulator to convert the reference text into annotated text, either at the word or phoneme level. Next, we generate corresponding dysfluent speech through a speech simulator. The Whisper feature extractor processes the resulting audio waveform, while the Whisper Tokenizer or CMU Phoneme Tokenizer handles the word or phoneme-level annotated text. These audio and text representations are then fed into a Whisper-like encoder-decoder architecture for training and prediction. a) shows the entire pipeline for token-based dysfluency detection. b) illustrates the rules for injecting tokenizing dysfluency into the text space. Here are audio samples of dysfluent speech https://bit.ly/3XI5CTu