Table of Contents
Fetching ...

On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

Kashaf Gulzar, Dominik Wagner, Sebastian P. Bayerl, Florian Hönig, Tobias Bocklet, Korbinian Riedhammer

TL;DR

This work tackles the challenge of dysfluency and fluency shaping in multilingual end-to-end ASR by introducing token-level markers (<d> for dysfluencies and <m> for modified speech) and employing parameter-efficient LoRA fine-tuning on the Whisper model. It demonstrates that token-level modeling can improve token placement and, to a degree, transcription accuracy on English (LSS) and German (KSoF) data, with multilingual adaptation via VoxPopuli German further boosting performance. A tokenization bias analysis reveals that English-centric BPE tokenization causes over-segmentation in German, limiting gains on KSoF and highlighting the need for tokenizer-aware multilingual approaches. Overall, the study provides a viable path for dysfluency-aware ASR while identifying tokenizer design as a critical barrier for clinical multilingual applications.

Abstract

Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.

On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

TL;DR

This work tackles the challenge of dysfluency and fluency shaping in multilingual end-to-end ASR by introducing token-level markers (<d> for dysfluencies and <m> for modified speech) and employing parameter-efficient LoRA fine-tuning on the Whisper model. It demonstrates that token-level modeling can improve token placement and, to a degree, transcription accuracy on English (LSS) and German (KSoF) data, with multilingual adaptation via VoxPopuli German further boosting performance. A tokenization bias analysis reveals that English-centric BPE tokenization causes over-segmentation in German, limiting gains on KSoF and highlighting the need for tokenizer-aware multilingual approaches. Overall, the study provides a viable path for dysfluency-aware ASR while identifying tokenizer design as a critical barrier for clinical multilingual applications.

Abstract

Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.

Paper Structure

This paper contains 9 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: System architecture for parameter-efficient fine-tuning of Whisper using LoRA adapters. Numbered markers indicate experimental variants: (1) LoRA fine-tuning for predicting <d> and <m> tokens; (2) the same, applied to a VoxPopuli-adapted model; (3) simultaneous fine-tuning for <d> prediction and <m> classification with a combined loss; (4) sequential fine-tuning for <d> prediction followed by <m> classification, each with separate losses.