Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script
Chaouki Boufenar, Mehdi Ayoub Rabiai, Boualem Nadjib Zahaf, Khelil Rafik Ouaras
TL;DR
This work tackles handwritten Arabic script recognition by fusing CNNs and Transformers to capture both local structural details and global contextual relationships. It introduces a multi-architecture framework, including a custom CNN, EfficientNet-B7, a custom Vision Transformer, and ViT-B16, followed by a confidence-based fusion ensemble that leverages each model's strengths. On the IFN/ENIT dataset, the ensemble achieves 96.38% letter accuracy and 97.22% positional accuracy, outperforming singles and demonstrating robustness to intra-class variability. The approach offers a scalable, real-world OCR solution for Arabic handwriting, with implications for extending to other cursive scripts and incorporating stroke-level cues in future work.
Abstract
Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.
