Table of Contents
Fetching ...

BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects

Jakir Hasan, Shubhashis Roy Dipta

TL;DR

BanglaTalk tackles the challenge of real-time Bengali speech assistance across diverse regional dialects by introducing a dialect-aware ASR (BRDialect) and a low-bandwidth, RTP-based client-server pipeline. The system combines lightweight client-side processing, a BRDialect-powered ASR trained on ten Bengali dialects, streaming LLM-driven responses, and TTS, achieving an average end-to-end latency of $4.9$ seconds while transmitting at approximately $24$ kbps. BRDialect shows strong regional generalization on RegSpeech12, with a reported $WER=0.741$ and $CER=0.406$, and is further enhanced via processing steps like Unicode normalization and punctuation removal; the LLM-based dialogue handling compensates residual transcription errors. BanglaTalk also demonstrates favorable end-to-end performance and a positive preliminary user study, indicating practical potential for inclusive Bengali dialect technology, especially in bandwidth-constrained settings. Overall, the work advances real-time, dialect-aware speech assistance for a major low-resource language and provides a concrete, scalable architecture for future expansion and evaluation across more dialects and applications.

Abstract

Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers. Code is available in https://github.com/Jak57/BanglaTalk

BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects

TL;DR

BanglaTalk tackles the challenge of real-time Bengali speech assistance across diverse regional dialects by introducing a dialect-aware ASR (BRDialect) and a low-bandwidth, RTP-based client-server pipeline. The system combines lightweight client-side processing, a BRDialect-powered ASR trained on ten Bengali dialects, streaming LLM-driven responses, and TTS, achieving an average end-to-end latency of seconds while transmitting at approximately kbps. BRDialect shows strong regional generalization on RegSpeech12, with a reported and , and is further enhanced via processing steps like Unicode normalization and punctuation removal; the LLM-based dialogue handling compensates residual transcription errors. BanglaTalk also demonstrates favorable end-to-end performance and a positive preliminary user study, indicating practical potential for inclusive Bengali dialect technology, especially in bandwidth-constrained settings. Overall, the work advances real-time, dialect-aware speech assistance for a major low-resource language and provides a concrete, scalable architecture for future expansion and evaluation across more dialects and applications.

Abstract

Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers. Code is available in https://github.com/Jak57/BanglaTalk

Paper Structure

This paper contains 52 sections, 2 equations, 10 figures, 9 tables, 3 algorithms.

Figures (10)

  • Figure 1: Existing Bengali speech assistants (left) fail to understand queries in regional dialects due to reliance on standard Bengali ASR (incorrect transcriptions are shown in red). BanglaTalk (right) successfully handles regional dialect queries through its dialect-aware ASR (BRDialect). It is bandwidth efficient and operates in real-time due to the incorporation of the Real-Time Transport Protocol.
  • Figure 2: Client (left) and server-side (right) processing pipelines of the BanglaTalk System.
  • Figure 3: Regionwise word error rate distribution of the test set of the RegSpeech12 dataset. Transcriptions are generated using the BRDialect ASR system.
  • Figure 4: Distribution of Levenshtein distance for the best processing settings - without noise cancellation, Bangla unicode normalization, and punctuation removal by three ASR systems on the RegSpeech12 dataset.
  • Figure 5: Uploading bitrate on the client side for a duration of one minute.
  • ...and 5 more figures