Table of Contents
Fetching ...

Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

H. M. Shadman Tabib, Istiak Ahmmed Rifti, Abdullah Muhammed Amimul Ehsan, Somik Dasgupta, Md Zim Mim Siddiqee Sowdha, Abrar Jahin Sarker, Md. Rafiul Islam Nijamy, Tanvir Hossain, Mst. Metaly Khatun, Munzer Mahmood, Rakesh Debnath, Gourab Biswas, Asif Karim, Wahid Al Azad Navid, Masnoon Muztahid, Fuad Ahmed Udoy, Shahad Shahriar Rahman, Md. Tashdiqur Rahman Shifat, Most. Sonia Khatun, Mushfiqur Rahman, Md. Miraj Hasan, Anik Saha, Mohammad Ninad Mahmud Nobo, Soumik Bhattacharjee, Tusher Bhomik, Ahmmad Nur Swapnil, Shahriar Kabir

TL;DR

The paper tackles the scarcity of long-form Bengali ASR and speaker diarization resources by introducing Bengali-Loop, which comprises two benchmarks: a long-form ASR corpus (191 recordings, 158.6 hours, 792k words) with human-verified transcripts, and a fully manual diarization corpus (24 recordings, 22 hours, 5,744 segments) with per-segment speaker labels. It provides standardized evaluation protocols for WER/CER and DER, along with baseline results (e.g., Tugstugi achieving 34.07% WER and pyannote.audio achieving 40.08% DER) to establish performance anchors. The work emphasizes reproducible benchmarking, including data formats, annotation rules, and evaluation scripts, to foster future model development for Bangla long-form ASR and diarization. Overall, Bengali-Loop offers publicly released data, clear evaluation standards, and practical baselines to accelerate progress in Bengali long-form speech technology.

Abstract

Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.

Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

TL;DR

The paper tackles the scarcity of long-form Bengali ASR and speaker diarization resources by introducing Bengali-Loop, which comprises two benchmarks: a long-form ASR corpus (191 recordings, 158.6 hours, 792k words) with human-verified transcripts, and a fully manual diarization corpus (24 recordings, 22 hours, 5,744 segments) with per-segment speaker labels. It provides standardized evaluation protocols for WER/CER and DER, along with baseline results (e.g., Tugstugi achieving 34.07% WER and pyannote.audio achieving 40.08% DER) to establish performance anchors. The work emphasizes reproducible benchmarking, including data formats, annotation rules, and evaluation scripts, to foster future model development for Bangla long-form ASR and diarization. Overall, Bengali-Loop offers publicly released data, clear evaluation standards, and practical baselines to accelerate progress in Bengali long-form speech technology.

Abstract

Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.
Paper Structure (26 sections, 1 equation, 1 figure, 6 tables)

This paper contains 26 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Bengali-Loop raw dataset statistics. (a) Distribution of recording durations (mean and median marked). (b) Distribution of transcript word counts per recording. (c) Source channel breakdown (abbreviations: Eagle Prem. = Eagle Premier Station; Banglavision = Banglavision DRAMA; Maasranga = Maasranga Drama; KS Ent. = KS Entertainment; Gollachut = GOLLACHUT; Raad = Raad Drama; Rabbit Ent. = Rabbit Entertainment; Club 11 = Club 11 Entertainment; Folk Studio = Folk Studio Bangla). (d) Duration vs. word count with linear fit. (e) Cumulative duration across recordings. (f) Subtitle source language distribution.