Table of Contents
Fetching ...

BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization

Md. Nazmus Sadat Samin, Jawad Ibn Ahad, Tanjila Ahmed Medha, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman

TL;DR

This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech and completed the end-to-end pipeline for dialect standardization by utilizing AlignTTS, a text-to-speech (TTS) model.

Abstract

This study focuses on recognizing Bangladeshi dialects and converting diverse Bengali accents into standardized formal Bengali speech. Dialects, often referred to as regional languages, are distinctive variations of a language spoken in a particular location and are identified by their phonetics, pronunciations, and lexicon. Subtle changes in pronunciation and intonation are also influenced by geographic location, educational attainment, and socioeconomic status. Dialect standardization is needed to ensure effective communication, educational consistency, access to technology, economic opportunities, and the preservation of linguistic resources while respecting cultural diversity. Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools. However, limited research exists due to a lack of comprehensive datasets and the challenges of handling diverse dialects. With the advancement in multilingual Large Language Models (mLLMs), emerging possibilities have been created to address the challenges of dialectal Automated Speech Recognition (ASR) and Machine Translation (MT). This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. This investigation includes constructing a large-scale diverse dataset with dialectal speech signals that tailored the fine-tuning process in ASR and LLM for transcribing the dialect speech to dialect text and translating the dialect text to standard Bangla text. Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.

BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization

TL;DR

This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech and completed the end-to-end pipeline for dialect standardization by utilizing AlignTTS, a text-to-speech (TTS) model.

Abstract

This study focuses on recognizing Bangladeshi dialects and converting diverse Bengali accents into standardized formal Bengali speech. Dialects, often referred to as regional languages, are distinctive variations of a language spoken in a particular location and are identified by their phonetics, pronunciations, and lexicon. Subtle changes in pronunciation and intonation are also influenced by geographic location, educational attainment, and socioeconomic status. Dialect standardization is needed to ensure effective communication, educational consistency, access to technology, economic opportunities, and the preservation of linguistic resources while respecting cultural diversity. Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools. However, limited research exists due to a lack of comprehensive datasets and the challenges of handling diverse dialects. With the advancement in multilingual Large Language Models (mLLMs), emerging possibilities have been created to address the challenges of dialectal Automated Speech Recognition (ASR) and Machine Translation (MT). This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. This investigation includes constructing a large-scale diverse dataset with dialectal speech signals that tailored the fine-tuning process in ASR and LLM for transcribing the dialect speech to dialect text and translating the dialect text to standard Bangla text. Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.

Paper Structure

This paper contains 12 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Typical Deep Neural Network (DNN) based implementation of ASR that uses word-based annotation combining HMM is used by al2019continuouskhan2018assessingsamin2021deep, (b) Recent advancement of LLMs let researchers to get involved in investigation towards the LLMs capability of handling speech signal. The mLLM-based approach has been applied in gudepu2020whisperpratama2024analysisphung2024enhancing using feature extraction and alphabet-wise mapping. Existing methods often fall short of processing big speech signal data, specifically with dialect speech signals, due to limitations of data availability and resources. On the other hand, end-to-end frameworks are less explored as per the literature. (c) We introduce a novel approach involving fine-tuning ASR and mLLMs with a large-scale low-resource Bangla dialect speech signal dataset. There are two parts, one is the dialect transcript from the dialect speech signal, which will be performed by the multilingual ASR model and then LLM will translate and standardize the ASR models' predicted dialect text into standard Bangla text. Our approach includes reliable preprocessing techniques to handle large-scale speech signals.
  • Figure 2: BanglaDialecto system: (a) input dialect speech signals $s^i$ are converted into wav form, then it undergoes the process of noise reduction and splitting into manageable 5-second speech segments $s^i_k$, Dialect text $t^i_d$ and standard text $t^i_s$ then segmented in corresponding chunks $t^i_{d,k}$ and $t^i_{s,k}$. (b) The segment $s^i_k$ and $t^i_d$ are used to fine-tune the ASR to predict and transcript dialect speech $s^i_k$ into dialect text $t^i_d$. The other segment $t^i_{s,k}$ alongside with $t^i_{d,k}$ is used to fine-tune the LLMs for MT from dialect text to standard Bangla text. (c) During end-to-end framework, the predicted transcript $\hat{t}^i_{d,k}$ of the dialect speech signal $s^i_k$ by the model $\mathcal{F}_1$ passes to the model $\mathcal{F}_2$ and then the model $\mathcal{F}_2$ predicts and translate standard Bangla text $\hat{t}^i_{s,k}$ from $\hat{t}^i_{d,k}$. We integrate a TTS model for generating standard Bangla speech signal from translated standard Bangla text $\hat{t}^i_{s,k}$.
  • Figure 3: Interview participant distribution across the Noakhali region. Conducting interviews with respondents of different regions added diversity of speech accents to our NDD dataset.
  • Figure 4: (a) The impact of increasing Whisper model parameters on accuracy, as measured by CER and WER. It shows that larger models, particularly Whisper-large V2, yield better performance, even in their pre-trained state, due to their higher parameter count. (b) Comparison of mBART, IndicBART, mT5 (Base), and BanglaT5 across CER, WER, and BLEU metrics. mT5 (Base) has the highest CER, mBART performs best in WER, and BLEU scores are similar for all models.