Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection

Shangkun Huang; Jing Deng; Jintao Kang; Rong Zheng

Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection

Shangkun Huang, Jing Deng, Jintao Kang, Rong Zheng

TL;DR

This work tackles the challenge of accurate ASR for stuttered Mandarin speech by proposing an LLM-driven, multi-task framework that jointly optimizes Automatic Speech Recognition and Stuttering Event Detection. A dynamic interaction mechanism integrates CTC-generated hypotheses with the LLM to suppress stuttering-induced hallucinations, while a SED branch provides stuttering embeddings to improve the model's fluency understanding; these are fused through multi-modal embeddings and LoRA-tuned Qwen2.5-3B-Instruct, trained with a hybrid loss combining $\mathcal{L}_{Focal}$ and $\mathcal{L}_{SupCon}$ in a total loss $\mathcal{L}_{total}$. The approach achieves CER $= 5.45\%$ and SED F1-score $= 73.63\%$ on AS-70, representing strong improvements over baselines and demonstrating a unified pipeline for disfluency-aware transcription with potential rehab applications and cross-lingual extensions. By enabling end-to-end optimization across speech and text modalities, this framework offers a foundation for advancing disfluency processing in real-world clinical and multilingual settings.

Abstract

The performance bottleneck of Automatic Speech Recognition (ASR) in stuttering speech scenarios has limited its applicability in domains such as speech rehabilitation. This paper proposed an LLM-driven ASR-SED multi-task learning framework that jointly optimized the ASR and Stuttering Event Detection (SED) tasks. We proposed a dynamic interaction mechanism where the ASR branch leveraged CTC-generated soft prompts to assist LLM context modeling, while the SED branch output stutter embeddings to enhance LLM comprehension of stuttered speech. We incorporated contrastive learning to strengthen the discriminative power of stuttering acoustic features and applied Focal Loss to mitigate the long-tailed distribution in stuttering event categories. Evaluations on the AS-70 Mandarin stuttering dataset demonstrated that our framework reduced the ASR character error rate (CER) to 5.45% (-37.71% relative reduction) and achieved an average SED F1-score of 73.63% (+46.58% relative improvement).

Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection

TL;DR

Abstract

Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)