WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Aurchi Chowdhury; Rubaiyat -E-Zaman; Sk. Ashrafuzzaman Nafees

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees

TL;DR

This paper presents the solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition and Speaker Diarization, with a robust audio chunking strategy utilizing whisper-timestamped and an integrated pipeline leveraging pyannote.audio and WhisperX.

Abstract

This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational dynamics and accurately resolve complex, overlapping speaker boundaries. Our methodology demonstrates that applying intelligent timestamped chunking to ASR and targeted segmentation fine-tuning to diarization significantly drives down Word Error Rate (WER) and Diarization Error Rate (DER), in low-resource settings.

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

TL;DR

Abstract

Paper Structure (25 sections, 2 figures, 6 tables)

This paper contains 25 sections, 2 figures, 6 tables.

Introduction
Novelty and Contribution
Semantic, Word-Boundary-Aware Chunking
Bangla-Adapted Diarization with Exclusive Overlap Handling and VAD Intersection
Method and Architecture
System Overview
Data Preparation
Fine-Tuning
Inference
Experiments and Evaluation
Evaluation Metric
Results
Findings and Analysis
Benchmarks and Comparisons
Method and Architecture
...and 10 more sections

Figures (2)

Figure 1: End-to-end training data pipeline: from raw long-form audio to aligned, boundary-respecting chunks for fine-tuning.
Figure 2: Proposed parallel diarization architecture

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

TL;DR

Abstract

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (2)