Table of Contents
Fetching ...

AraSpell: A Deep Learning Approach for Arabic Spelling Correction

Mahmoud Salhab, Faisal Abu-Khzam

TL;DR

The paper tackles Arabic spelling correction by introducing AraSpell, an end-to-end Seq2Seq framework trained on a large corpus (over 6.9 million sentences) with synthetic error injection. It compares Attentional vanilla RNN, Attentional stacked-RNN, and Transformer architectures, finding the Transformer to deliver the best performance on a 100K test set (CER ~1.11% and WER ~4.8% in a mixed-data setup). The approach demonstrates strong correction capabilities under varying corruption levels (5% and 10%), and data mixing further boosts results, highlighting the value of data augmentation in low-resource Arabic NLP tasks. The work provides a scalable, practical framework with an open-source mindset, suitable for improving OCR post-processing, search query correction, and Arabic NLU pipelines.

Abstract

Spelling correction is the task of identifying spelling mistakes, typos, and grammatical mistakes in a given text and correcting them according to their context and grammatical structure. This work introduces "AraSpell," a framework for Arabic spelling correction using different seq2seq model architectures such as Recurrent Neural Network (RNN) and Transformer with artificial data generation for error injection, trained on more than 6.9 Million Arabic sentences. Thorough experimental studies provide empirical evidence of the effectiveness of the proposed approach, which achieved 4.8% and 1.11% word error rate (WER) and character error rate (CER), respectively, in comparison with labeled data of 29.72% WER and 5.03% CER. Our approach achieved 2.9% CER and 10.65% WER in comparison with labeled data of 10.02% CER and 50.94% WER. Both of these results are obtained on a test set of 100K sentences.

AraSpell: A Deep Learning Approach for Arabic Spelling Correction

TL;DR

The paper tackles Arabic spelling correction by introducing AraSpell, an end-to-end Seq2Seq framework trained on a large corpus (over 6.9 million sentences) with synthetic error injection. It compares Attentional vanilla RNN, Attentional stacked-RNN, and Transformer architectures, finding the Transformer to deliver the best performance on a 100K test set (CER ~1.11% and WER ~4.8% in a mixed-data setup). The approach demonstrates strong correction capabilities under varying corruption levels (5% and 10%), and data mixing further boosts results, highlighting the value of data augmentation in low-resource Arabic NLP tasks. The work provides a scalable, practical framework with an open-source mindset, suitable for improving OCR post-processing, search query correction, and Arabic NLU pipelines.

Abstract

Spelling correction is the task of identifying spelling mistakes, typos, and grammatical mistakes in a given text and correcting them according to their context and grammatical structure. This work introduces "AraSpell," a framework for Arabic spelling correction using different seq2seq model architectures such as Recurrent Neural Network (RNN) and Transformer with artificial data generation for error injection, trained on more than 6.9 Million Arabic sentences. Thorough experimental studies provide empirical evidence of the effectiveness of the proposed approach, which achieved 4.8% and 1.11% word error rate (WER) and character error rate (CER), respectively, in comparison with labeled data of 29.72% WER and 5.03% CER. Our approach achieved 2.9% CER and 10.65% WER in comparison with labeled data of 10.02% CER and 50.94% WER. Both of these results are obtained on a test set of 100K sentences.
Paper Structure (14 sections, 11 equations, 4 figures, 5 tables)

This paper contains 14 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of AraSpell project structure.
  • Figure 2: Attentional vanilla Seq2Seq using RNN.
  • Figure 3: Attentional Seq2Seq with stacked RNN blocks.
  • Figure 4: Attention generated during inference across all heads of the last decoder layer of the transformer model.