SSDM: Scalable Speech Dysfluency Modeling

Jiachen Lian; Xuanru Zhou; Zoe Ezzes; Jet Vonk; Brittany Morin; David Baquirin; Zachary Mille; Maria Luisa Gorno Tempini; Gopala Krishna Anumanchipalli

SSDM: Scalable Speech Dysfluency Modeling

Jiachen Lian, Xuanru Zhou, Zoe Ezzes, Jet Vonk, Brittany Morin, David Baquirin, Zachary Mille, Maria Luisa Gorno Tempini, Gopala Krishna Anumanchipalli

TL;DR

SSDM is proposed, which adopts articulatory gestures as scalable forced alignment; introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; introduces a large-scale simulated dysfluency corpus called Libri-Dys; and develops an end-to-end system by leveraging the power of large language models (LLMs).

Abstract

Speech dysfluency modeling is the core module for spoken language learning, and speech therapy. However, there are three challenges. First, current state-of-the-art solutions\cite{lian2023unconstrained-udm, lian-anumanchipalli-2024-towards-hudm} suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In this paper, we propose \textit{SSDM: Scalable Speech Dysfluency Modeling}, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling. Demo is available at \url{https://berkeley-speech-group.github.io/SSDM/}.

SSDM: Scalable Speech Dysfluency Modeling

TL;DR

Abstract

Paper Structure (70 sections, 16 equations, 15 figures, 7 tables, 2 algorithms)

This paper contains 70 sections, 16 equations, 15 figures, 7 tables, 2 algorithms.

Introduction
Articulatory Gesture is Scalable Forced Aligner
Background
Revisit Speech Representation Learning
Gestural Modeling
Scalable Dysfluent Phonetic Forced Aligner
Neural Variational Gestural modeling
Universal Acoustic to Articulatory Inversion (UAAI)
Gestural Variational Autoencoders
Variational Inference
VAE Objective
Duration Posterior $q_{\phi}(D^{k,i}|Z^{k,i}, X^{k,i})$
Intensity Posterior $q_{\phi}(I^{k,i}|Z^{k,i}, X^{k,i})$
Online Sparse Sampling
Multi-scale Gestural Decoder
...and 55 more sections

Figures (15)

Figure 1: SSDM. Comparison to other methods
Figure 2: SSDM architecture
Figure 3: LSA(LCS) delivers dysfluent alignment that is more semantically aligned.
Figure 4: CSA
Figure 5: Gestural Dysfluency Visualization
...and 10 more figures

SSDM: Scalable Speech Dysfluency Modeling

TL;DR

Abstract

SSDM: Scalable Speech Dysfluency Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (15)