Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition
Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater
TL;DR
The Muharaf paper tackles the scarcity of publicly available, annotated historical handwritten Arabic manuscripts for HTR by presenting a large, carefully labeled dataset of 1,644 images with line-level and page-element annotations. It details a two-phase data pipeline: expert transcription of lines followed by deep learning-based text prediction with expert correction, plus a preliminary CNN-based baseline using this data that fits within typical low-resource GPU constraints ($8$ GB). The dataset includes diverse genres and centuries, enabling not only HTR but also tasks like text-line segmentation and writer/style analysis, with 1216 images publicly released and 428 under proprietary license. The work positions Muharaf within the broader Arabic HTR landscape, compares it to existing Category 2 datasets, and highlights its potential for cross-language adaptation to related scripts such as Urdu, Farsi, and Pashto.
Abstract
We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.
