Table of Contents
Fetching ...

Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

Chaouki Boufenar, Mehdi Ayoub Rabiai, Boualem Nadjib Zahaf, Khelil Rafik Ouaras

TL;DR

This work tackles handwritten Arabic script recognition by fusing CNNs and Transformers to capture both local structural details and global contextual relationships. It introduces a multi-architecture framework, including a custom CNN, EfficientNet-B7, a custom Vision Transformer, and ViT-B16, followed by a confidence-based fusion ensemble that leverages each model's strengths. On the IFN/ENIT dataset, the ensemble achieves 96.38% letter accuracy and 97.22% positional accuracy, outperforming singles and demonstrating robustness to intra-class variability. The approach offers a scalable, real-world OCR solution for Arabic handwriting, with implications for extending to other cursive scripts and incorporating stroke-level cues in future work.

Abstract

Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.

Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

TL;DR

This work tackles handwritten Arabic script recognition by fusing CNNs and Transformers to capture both local structural details and global contextual relationships. It introduces a multi-architecture framework, including a custom CNN, EfficientNet-B7, a custom Vision Transformer, and ViT-B16, followed by a confidence-based fusion ensemble that leverages each model's strengths. On the IFN/ENIT dataset, the ensemble achieves 96.38% letter accuracy and 97.22% positional accuracy, outperforming singles and demonstrating robustness to intra-class variability. The approach offers a scalable, real-world OCR solution for Arabic handwriting, with implications for extending to other cursive scripts and incorporating stroke-level cues in future work.

Abstract

Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.

Paper Structure

This paper contains 42 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Example of a word image from the IFN/ENIT dataset.
  • Figure 2: Example of the segmented letters from the IFN/ENIT dataset.
  • Figure 3: Examples of Each Applied Transformation.
  • Figure 4: Illustration of the CNN Architecture and pipeline.
  • Figure 5: Illustration of the EfficientNet-B7 Architecture and pipeline.
  • ...and 3 more figures