BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla
Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Kazi Samin, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, Rifat Shahriyar
TL;DR
BanglaBERT introduces a Bangla-centric pretraining regime using ELECTRA RTD on a large Bangla corpus, backed by BanglishBERT for bilingual transfer. The authors establish BLUB, the first comprehensive Bangla NLU benchmark spanning sentiment, NLI, NER, and QA, and demonstrate state-of-the-art results on BLUB with BanglaBERT, plus strong zero-shot performance with BanglishBERT. The work highlights sample- and compute-efficiency advantages in data-scarce settings and provides public release of models, data, and a leaderboard to accelerate Bangla NLP advancement. Together, these contributions significantly advance resource-scarce Bangla NLP by offering dedicated models, robust benchmarks, and open resources for future research.
Abstract
In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.
