BSpell: A CNN-Blended BERT Based Bangla Spell Checker

Chowdhury Rafeed Rahman; MD. Hasibur Rahman; Samiha Zakir; Mohammad Rafsan; Mohammed Eunus Ali

BSpell: A CNN-Blended BERT Based Bangla Spell Checker

Chowdhury Rafeed Rahman, MD. Hasibur Rahman, Samiha Zakir, Mohammad Rafsan, Mohammed Eunus Ali

TL;DR

Bangla spelling correction is challenged by complex compound characters and QWERTY-based input. The authors introduce BSpell, a CNN-Blended BERT model that uses a per-word SemanticNet CNN for intra-word patterns and a main BERT_Base branch for sentence-level context, aided by an auxiliary loss and a novel hybrid pretraining scheme that blends word and character masking. The approach achieves state-of-the-art performance on Bangla and Hindi datasets, with extensive ablations confirming the contributions of SemanticNet, auxiliary loss, and hybrid pretraining. While effective, BSpell operates on word-for-word corrections and may output UNK for rare terms or fail on word merges/splits, pointing to future work in subword-level modeling and grammar-aware corrections for broader applicability.

Abstract

Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checker

BSpell: A CNN-Blended BERT Based Bangla Spell Checker

TL;DR

Abstract

Paper Structure (24 sections, 6 figures, 6 tables)

This paper contains 24 sections, 6 figures, 6 tables.

Introduction
Related Works
Our Approach
Problem Statement
BSpell Architecture
SemanticNet Sub-Model
BERT_Base as Main Branch
Auxiliary Loss in Secondary Branch
BERT Hybrid Pretraining
Experimental Setup
Implemented Pretraining Schemes
Dataset Specification
BSpell Architecture Hyperparameters
Results and Discussion
Training and Validation Details
...and 9 more sections

Figures (6)

Figure 1: Heterogeneous character number between error word and corresponding correctly spelled word
Figure 2: ample words that are correctly spelled accidentally, but are context-wise incorrect.
Figure 3: Necessity of understanding existing erroneous words for spelling correction of misspelled words
Figure 4: BSpell architecture details
Figure 5: BERT hybrid pretraining
...and 1 more figures

BSpell: A CNN-Blended BERT Based Bangla Spell Checker

TL;DR

Abstract

BSpell: A CNN-Blended BERT Based Bangla Spell Checker

Authors

TL;DR

Abstract

Table of Contents

Figures (6)