BSpell: A CNN-Blended BERT Based Bangla Spell Checker
Chowdhury Rafeed Rahman, MD. Hasibur Rahman, Samiha Zakir, Mohammad Rafsan, Mohammed Eunus Ali
TL;DR
Bangla spelling correction is challenged by complex compound characters and QWERTY-based input. The authors introduce BSpell, a CNN-Blended BERT model that uses a per-word SemanticNet CNN for intra-word patterns and a main BERT_Base branch for sentence-level context, aided by an auxiliary loss and a novel hybrid pretraining scheme that blends word and character masking. The approach achieves state-of-the-art performance on Bangla and Hindi datasets, with extensive ablations confirming the contributions of SemanticNet, auxiliary loss, and hybrid pretraining. While effective, BSpell operates on word-for-word corrections and may output UNK for rare terms or fail on word merges/splits, pointing to future work in subword-level modeling and grammar-aware corrections for broader applicability.
Abstract
Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checker
