AraSpell: A Deep Learning Approach for Arabic Spelling Correction
Mahmoud Salhab, Faisal Abu-Khzam
TL;DR
The paper tackles Arabic spelling correction by introducing AraSpell, an end-to-end Seq2Seq framework trained on a large corpus (over 6.9 million sentences) with synthetic error injection. It compares Attentional vanilla RNN, Attentional stacked-RNN, and Transformer architectures, finding the Transformer to deliver the best performance on a 100K test set (CER ~1.11% and WER ~4.8% in a mixed-data setup). The approach demonstrates strong correction capabilities under varying corruption levels (5% and 10%), and data mixing further boosts results, highlighting the value of data augmentation in low-resource Arabic NLP tasks. The work provides a scalable, practical framework with an open-source mindset, suitable for improving OCR post-processing, search query correction, and Arabic NLU pipelines.
Abstract
Spelling correction is the task of identifying spelling mistakes, typos, and grammatical mistakes in a given text and correcting them according to their context and grammatical structure. This work introduces "AraSpell," a framework for Arabic spelling correction using different seq2seq model architectures such as Recurrent Neural Network (RNN) and Transformer with artificial data generation for error injection, trained on more than 6.9 Million Arabic sentences. Thorough experimental studies provide empirical evidence of the effectiveness of the proposed approach, which achieved 4.8% and 1.11% word error rate (WER) and character error rate (CER), respectively, in comparison with labeled data of 29.72% WER and 5.03% CER. Our approach achieved 2.9% CER and 10.65% WER in comparison with labeled data of 10.02% CER and 50.94% WER. Both of these results are obtained on a test set of 100K sentences.
