Table of Contents
Fetching ...

FGCL: Fine-grained Contrastive Learning For Mandarin Stuttering Event Detection

Han Jiang, Wenyu Wang, Yiquan Zhou, Hongwu Ding, Jiacheng Xu, Jihua Zhu

TL;DR

This work tackles Mandarin Stuttering Event Detection (MSED) by introducing Fine-Grained Contrastive Learning (FGCL), a framework that models frame-level stuttering likelihood, identifies easy and confusing frames, and refines their representations via a dedicated stutter contrast loss. The method consists of three core components: (i) likelihood modeling to obtain frame-wise probabilities, (ii) confusing/easy frame mining using a cascaded contraction/expansion scheme and top/bottom frame selection, and (iii) a two-part contrast loss that strengthens the separation between stuttered and fluent frames without requiring data augmentation. FGCL is designed as a plug-and-play enhancement to standard baselines (e.g., Conformer) and achieves notable gains on Mandarin data, including an official F1 improvement of about 3.0%, rising further to over 5.1% with parameter tuning; it also shows robustness across English datasets and with HuBERT-based features. Overall, FGCL advances fine-grained acoustic modeling for SED, enabling more discriminative frame embeddings and improved detection performance with practical implications for deploying Mandarin stuttering monitoring systems.

Abstract

This paper presents the T031 team's approach to the StutteringSpeech Challenge in SLT2024. Mandarin Stuttering Event Detection (MSED) aims to detect instances of stuttering events in Mandarin speech. We propose a detailed acoustic analysis method to improve the accuracy of stutter detection by capturing subtle nuances that previous Stuttering Event Detection (SED) techniques have overlooked. To this end, we introduce the Fine-Grained Contrastive Learning (FGCL) framework for MSED. Specifically, we model the frame-level probabilities of stuttering events and introduce a mining algorithm to identify both easy and confusing frames. Then, we propose a stutter contrast loss to enhance the distinction between stuttered and fluent speech frames, thereby improving the discriminative capability of stuttered feature embeddings. Extensive evaluations on English and Mandarin datasets demonstrate the effectiveness of FGCL, achieving a significant increase of over 5.0% in F1 score on Mandarin data.

FGCL: Fine-grained Contrastive Learning For Mandarin Stuttering Event Detection

TL;DR

This work tackles Mandarin Stuttering Event Detection (MSED) by introducing Fine-Grained Contrastive Learning (FGCL), a framework that models frame-level stuttering likelihood, identifies easy and confusing frames, and refines their representations via a dedicated stutter contrast loss. The method consists of three core components: (i) likelihood modeling to obtain frame-wise probabilities, (ii) confusing/easy frame mining using a cascaded contraction/expansion scheme and top/bottom frame selection, and (iii) a two-part contrast loss that strengthens the separation between stuttered and fluent frames without requiring data augmentation. FGCL is designed as a plug-and-play enhancement to standard baselines (e.g., Conformer) and achieves notable gains on Mandarin data, including an official F1 improvement of about 3.0%, rising further to over 5.1% with parameter tuning; it also shows robustness across English datasets and with HuBERT-based features. Overall, FGCL advances fine-grained acoustic modeling for SED, enabling more discriminative frame embeddings and improved detection performance with practical implications for deploying Mandarin stuttering monitoring systems.

Abstract

This paper presents the T031 team's approach to the StutteringSpeech Challenge in SLT2024. Mandarin Stuttering Event Detection (MSED) aims to detect instances of stuttering events in Mandarin speech. We propose a detailed acoustic analysis method to improve the accuracy of stutter detection by capturing subtle nuances that previous Stuttering Event Detection (SED) techniques have overlooked. To this end, we introduce the Fine-Grained Contrastive Learning (FGCL) framework for MSED. Specifically, we model the frame-level probabilities of stuttering events and introduce a mining algorithm to identify both easy and confusing frames. Then, we propose a stutter contrast loss to enhance the distinction between stuttered and fluent speech frames, thereby improving the discriminative capability of stuttered feature embeddings. Extensive evaluations on English and Mandarin datasets demonstrate the effectiveness of FGCL, achieving a significant increase of over 5.0% in F1 score on Mandarin data.
Paper Structure (20 sections, 14 equations, 2 figures, 4 tables)

This paper contains 20 sections, 14 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our FGCL system. Following frame-level likelihood modeling, we mine confusing and easy frames. $\mathcal{L}_{SC}$ aims to refine the confusing embeddings.
  • Figure 2: A toy example of mining the confusing frames. The orange dashed box represents the smaller mask ($m=2$), while the purple dashed box represents the larger ($M=4$). We sample confusing frames from the borders of the pseudo-boundaries.