Table of Contents
Fetching ...

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

MD. Sagor Chowdhury, Adiba Fairooz Chowdhury

TL;DR

These experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.

Abstract

We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

TL;DR

These experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.

Abstract

We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the pyannote.audio pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
Paper Structure (31 sections, 5 figures, 3 tables)

This paper contains 31 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the DL Sprint 4.0 tasks. The ASR track maps long-form Bengali speech to transcribed text, while the diarization track outputs time-stamped speaker segments in structured format.
  • Figure 2: Overview of the ASR methodology, showing model selection, hyperparameter tuning, and fine-tuning strategies used for final test-set generation.
  • Figure 3: Best ASR pipeline: vocal separation and peak normalisation feed into silence-based chunking, followed by Whisper-Medium beam-search decoding and lightweight post-processing.
  • Figure 4: Best diarization pipeline: fine-tuned Bengali segmentation model and WeSpeaker ResNet34-LM embedding extractor assembled inside the pyannote/speaker-diarization-3.1 backbone, with centroid clustering ($\tau=0.65$, min_cluster_size=20).
  • Figure 5: Model Comparison on 10s Probe (train_001). The highlighted row (bengaliai-asr_whisper-medium) produced the most accurate transcription correctly rendering.