Table of Contents
Fetching ...

Boosting Disfluency Detection with Large Language Model as Disfluency Generator

Zhenrong Cheng, Jiayan Guo, Hao Sun, Yan Zhang

TL;DR

This work tackles data scarcity in disfluency detection by introducing a lightweight data augmentation pipeline that uses large language models to generate diverse, realistic disfluent sentences through prompt-based generation without fine-tuning. An uncertainty-aware filtering step selects high-quality generated samples to train a small, model-agnostic detector, achieving state-of-the-art performance on Switchboard with only a modest amount of augmented data. The approach demonstrates strong cost-efficiency, matching or surpassing baselines that rely on extensive annotated data or fine-tuned generation models. It also highlights the practical potential for deploying lightweight disfluency detectors in ASR and dialogue systems, with avenues for extending to speaker-aware and paraphrasing tasks.

Abstract

Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data augmentation approach for disfluency detection, utilizing the superior generative and semantic understanding capabilities of large language model (LLM) to generate disfluent sentences as augmentation data. We leverage LLM to generate diverse and more realistic sentences guided by specific prompts, without the need for fine-tuning the LLM. Subsequently, we apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences, utilized in training a small detection model for improved performance. Experiments using enhanced data yielded state-of-the-art results. The results showed that using a small amount of LLM-generated enhanced data can significantly improve performance, thereby further enhancing cost-effectiveness. Our code is available here.

Boosting Disfluency Detection with Large Language Model as Disfluency Generator

TL;DR

This work tackles data scarcity in disfluency detection by introducing a lightweight data augmentation pipeline that uses large language models to generate diverse, realistic disfluent sentences through prompt-based generation without fine-tuning. An uncertainty-aware filtering step selects high-quality generated samples to train a small, model-agnostic detector, achieving state-of-the-art performance on Switchboard with only a modest amount of augmented data. The approach demonstrates strong cost-efficiency, matching or surpassing baselines that rely on extensive annotated data or fine-tuned generation models. It also highlights the practical potential for deploying lightweight disfluency detectors in ASR and dialogue systems, with avenues for extending to speaker-aware and paraphrasing tasks.

Abstract

Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data augmentation approach for disfluency detection, utilizing the superior generative and semantic understanding capabilities of large language model (LLM) to generate disfluent sentences as augmentation data. We leverage LLM to generate diverse and more realistic sentences guided by specific prompts, without the need for fine-tuning the LLM. Subsequently, we apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences, utilized in training a small detection model for improved performance. Experiments using enhanced data yielded state-of-the-art results. The results showed that using a small amount of LLM-generated enhanced data can significantly improve performance, thereby further enhancing cost-effectiveness. Our code is available here.
Paper Structure (16 sections, 4 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: An Overview of Our Approach: For disfluency generation, we begin by providing a description prompt to the LLM and guide its generation of disfluent sentences through a series of generation prompts. Subsequently, we use the trainable detection model to filter out high-quality sentences after pretraining on Switchboard, namely DisAug. For disfluency detection, DisAug is incorporated with Switchboard as training data to train the original detection model.
  • Figure 2: F1-scorce under different confidence thresholds. The values on the line represent the quantity of generated sentences filtered at that threshold.
  • Figure 3: Results of Disfluency Detection with ELECTRA as a Detector under Different Experimental Setups.