Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition
Gagan Bhatia, El Moatez Billah Nagoudi, Fakhraddin Alwajih, Muhammad Abdul-Mageed
TL;DR
Qalam tackles the challenging problem of Arabic OCR and handwriting recognition by introducing a vision-encoder/transformer-decoder foundation model built on SwinV2 and RoBERTa. It is trained on a large, diverse corpus including $4.5$M manuscript images and $60k$ synthetic pairs, with explicit handling of diacritics and high-resolution inputs. On the MIDAD benchmark, Qalam achieves state-of-the-art results, with HWR WER of $0.80\%$ and OCR WER of $1.18\%$, and demonstrates strong zero-shot performance on KHATT ($CER = 10.43$). The work also presents the MIDAD benchmark, analyzes encoder/decoder components, and shows practical implications for Arabic script recognition, while acknowledging domain gaps and diacritic-vocabulary limitations. Overall, Qalam advances Arabic OCR/HWR by combining high-capacity vision and language modeling and offers a scalable foundation for future script-specific recognition systems.
Abstract
Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.
