Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Gagan Bhatia; El Moatez Billah Nagoudi; Fakhraddin Alwajih; Muhammad Abdul-Mageed

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Gagan Bhatia, El Moatez Billah Nagoudi, Fakhraddin Alwajih, Muhammad Abdul-Mageed

TL;DR

Qalam tackles the challenging problem of Arabic OCR and handwriting recognition by introducing a vision-encoder/transformer-decoder foundation model built on SwinV2 and RoBERTa. It is trained on a large, diverse corpus including $4.5$M manuscript images and $60k$ synthetic pairs, with explicit handling of diacritics and high-resolution inputs. On the MIDAD benchmark, Qalam achieves state-of-the-art results, with HWR WER of $0.80\%$ and OCR WER of $1.18\%$, and demonstrates strong zero-shot performance on KHATT ($CER = 10.43$). The work also presents the MIDAD benchmark, analyzes encoder/decoder components, and shows practical implications for Arabic script recognition, while acknowledging domain gaps and diacritic-vocabulary limitations. Overall, Qalam advances Arabic OCR/HWR by combining high-capacity vision and language modeling and offers a scalable foundation for future script-specific recognition systems.

Abstract

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

TL;DR

M manuscript images and

synthetic pairs, with explicit handling of diacritics and high-resolution inputs. On the MIDAD benchmark, Qalam achieves state-of-the-art results, with HWR WER of

and OCR WER of

, and demonstrates strong zero-shot performance on KHATT (

). The work also presents the MIDAD benchmark, analyzes encoder/decoder components, and shows practical implications for Arabic script recognition, while acknowledging domain gaps and diacritic-vocabulary limitations. Overall, Qalam advances Arabic OCR/HWR by combining high-capacity vision and language modeling and offers a scalable foundation for future script-specific recognition systems.

Abstract

Paper Structure (40 sections, 2 equations, 10 figures, 8 tables)

This paper contains 40 sections, 2 equations, 10 figures, 8 tables.

Introduction
Related Works
Handwriting Recognition (HWR).
Optical Character Recognition (OCR).
Multimodal Large Language Models (MLLMs).
Arabic HWR and OCR.
MIDAD Benchmark
Datasets
Data Splits
In the wild Arabic OCR and HWR Datasets
Methods
Encoder Configuration
Decoder Configuration
Baselines
Evaluation Metrics
...and 25 more sections

Figures (10)

Figure 1: An illustrative overview of Qalam's functioning for Arabic OCR and HWR across diverse text types.
Figure 2: An illustrative depiction highlighting the distinctive characteristics of Arabic script that contribute to its increased complexity compared to other languages.
Figure 3: In-the-wild Arabic dataset samples.
Figure 4: A showcase of diverse Arabic script datasets, illustrating the intricate and multifaceted challenges addressed by Qalam: (a) Character-based examples, (b) Word-oriented examples, (c) Line-based examples
Figure 5: Example of a high-resolution image that may present processing difficulties for DeiT. The images are reconstructions produced by both DeiT and SwinV2 following MIM training.
...and 5 more figures

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

TL;DR

Abstract

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (10)