Table of Contents
Fetching ...

Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar

TL;DR

This work tackles the ambiguity in Bangla ASR transcripts where word-word repetitions can signal either Repetition Disfluency or Morphological Reduplication. It introduces the first public Bangla Repetition Corpus with 20,000 labeled examples and a fine-grained nine-class reduplication scheme, evaluated via both in-context LLM prompting and task-specific encoder fine-tuning. The results show that fine-tuning BanglaBERT achieves the highest accuracy (84.78%) and F1 (0.677), outperforming prompting approaches, and provide a robust baseline for semantic-preserving Bangla text normalization. The dataset and findings advance practical NLP for Bangla by enabling models to preserve linguistic meaning during normalization of noisy ASR outputs.

Abstract

Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.

Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

TL;DR

This work tackles the ambiguity in Bangla ASR transcripts where word-word repetitions can signal either Repetition Disfluency or Morphological Reduplication. It introduces the first public Bangla Repetition Corpus with 20,000 labeled examples and a fine-grained nine-class reduplication scheme, evaluated via both in-context LLM prompting and task-specific encoder fine-tuning. The results show that fine-tuning BanglaBERT achieves the highest accuracy (84.78%) and F1 (0.677), outperforming prompting approaches, and provide a robust baseline for semantic-preserving Bangla text normalization. The dataset and findings advance practical NLP for Bangla by enabling models to preserve linguistic meaning during normalization of noisy ASR outputs.

Abstract

Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.

Paper Structure

This paper contains 24 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of the Bangla Repetition Classification task, highlighting the distinction between unintentional disfluencies (Repetition), grammatical forms (Reduplication), and coincidental occurrences (Neither).
  • Figure 2: The end-to-end pipeline for creating the Bangla Repetition Corpus. The workflow begins with scalable data acquisition from YouTube ASR transcripts, followed by automated filtering and context extraction. The core annotation phase employs a hybrid approach, using an LLM for initial labeling and expert linguists for final verification, resulting in a 20,000-row gold-standard corpus.
  • Figure 3: Accuracy Comparison Before and After Fine-Tuning