Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis
Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar
TL;DR
This work tackles the ambiguity in Bangla ASR transcripts where word-word repetitions can signal either Repetition Disfluency or Morphological Reduplication. It introduces the first public Bangla Repetition Corpus with 20,000 labeled examples and a fine-grained nine-class reduplication scheme, evaluated via both in-context LLM prompting and task-specific encoder fine-tuning. The results show that fine-tuning BanglaBERT achieves the highest accuracy (84.78%) and F1 (0.677), outperforming prompting approaches, and provide a robust baseline for semantic-preserving Bangla text normalization. The dataset and findings advance practical NLP for Bangla by enabling models to preserve linguistic meaning during normalization of noisy ASR outputs.
Abstract
Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
