BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
Jakir Hasan, Shubhashis Roy Dipta
TL;DR
BanglaTalk tackles the challenge of real-time Bengali speech assistance across diverse regional dialects by introducing a dialect-aware ASR (BRDialect) and a low-bandwidth, RTP-based client-server pipeline. The system combines lightweight client-side processing, a BRDialect-powered ASR trained on ten Bengali dialects, streaming LLM-driven responses, and TTS, achieving an average end-to-end latency of $4.9$ seconds while transmitting at approximately $24$ kbps. BRDialect shows strong regional generalization on RegSpeech12, with a reported $WER=0.741$ and $CER=0.406$, and is further enhanced via processing steps like Unicode normalization and punctuation removal; the LLM-based dialogue handling compensates residual transcription errors. BanglaTalk also demonstrates favorable end-to-end performance and a positive preliminary user study, indicating practical potential for inclusive Bengali dialect technology, especially in bandwidth-constrained settings. Overall, the work advances real-time, dialect-aware speech assistance for a major low-resource language and provides a concrete, scalable architecture for future expansion and evaluation across more dialects and applications.
Abstract
Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers. Code is available in https://github.com/Jak57/BanglaTalk
