Table of Contents
Fetching ...

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Gagan Bhatia, El Moatez Billah Nagoudi, Fakhraddin Alwajih, Muhammad Abdul-Mageed

TL;DR

Qalam tackles the challenging problem of Arabic OCR and handwriting recognition by introducing a vision-encoder/transformer-decoder foundation model built on SwinV2 and RoBERTa. It is trained on a large, diverse corpus including $4.5$M manuscript images and $60k$ synthetic pairs, with explicit handling of diacritics and high-resolution inputs. On the MIDAD benchmark, Qalam achieves state-of-the-art results, with HWR WER of $0.80\%$ and OCR WER of $1.18\%$, and demonstrates strong zero-shot performance on KHATT ($CER = 10.43$). The work also presents the MIDAD benchmark, analyzes encoder/decoder components, and shows practical implications for Arabic script recognition, while acknowledging domain gaps and diacritic-vocabulary limitations. Overall, Qalam advances Arabic OCR/HWR by combining high-capacity vision and language modeling and offers a scalable foundation for future script-specific recognition systems.

Abstract

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

TL;DR

Qalam tackles the challenging problem of Arabic OCR and handwriting recognition by introducing a vision-encoder/transformer-decoder foundation model built on SwinV2 and RoBERTa. It is trained on a large, diverse corpus including M manuscript images and synthetic pairs, with explicit handling of diacritics and high-resolution inputs. On the MIDAD benchmark, Qalam achieves state-of-the-art results, with HWR WER of and OCR WER of , and demonstrates strong zero-shot performance on KHATT (). The work also presents the MIDAD benchmark, analyzes encoder/decoder components, and shows practical implications for Arabic script recognition, while acknowledging domain gaps and diacritic-vocabulary limitations. Overall, Qalam advances Arabic OCR/HWR by combining high-capacity vision and language modeling and offers a scalable foundation for future script-specific recognition systems.

Abstract

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.
Paper Structure (40 sections, 2 equations, 10 figures, 8 tables)

This paper contains 40 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An illustrative overview of Qalam's functioning for Arabic OCR and HWR across diverse text types.
  • Figure 2: An illustrative depiction highlighting the distinctive characteristics of Arabic script that contribute to its increased complexity compared to other languages.
  • Figure 3: In-the-wild Arabic dataset samples.
  • Figure 4: A showcase of diverse Arabic script datasets, illustrating the intricate and multifaceted challenges addressed by Qalam: (a) Character-based examples, (b) Word-oriented examples, (c) Line-based examples
  • Figure 5: Example of a high-resolution image that may present processing difficulties for DeiT. The images are reconstructions produced by both DeiT and SwinV2 following MIM training.
  • ...and 5 more figures