Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed; Hossein Hassani

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Ameer Majeed, Hossein Hassani

TL;DR

The paper addresses digitizing handwritten East Syriac texts by creating KHAMIS, a 624-sentence handwritten Syriac dataset, and fine-tuning a pretrained Syriac model in Tesseract for improved recognition. Using Tesstrain on KHAMIS, the authors achieve low CER on training ($1.097$–$1.610\%$) and evaluation ($8.963$–$10.498\%$), and substantially better test performance ($CER\approx 18.9$–$19.7\%$, $WER\approx 62.8$–$65.4\%$) than the default Syriac model, demonstrating the viability of OCR for this low-resource language with dataset-driven fine-tuning. KHAMIS provides a scalable resource for research and development of Syriac digital services, while acknowledging limitations from dataset size and script coverage; future work includes expanding to Estrangela and West Syriac, adding diacritics, data augmentation, and exploring alternative algorithms. The work thus contributes a practical framework and dataset to advance computational access to Syriac cultural heritage through handwritten OCR, enabling more robust digital humanities analyses and language technology for endangered scripts.

Abstract

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

TL;DR

–

) and evaluation (

–

), and substantially better test performance (

–

) than the default Syriac model, demonstrating the viability of OCR for this low-resource language with dataset-driven fine-tuning. KHAMIS provides a scalable resource for research and development of Syriac digital services, while acknowledging limitations from dataset size and script coverage; future work includes expanding to Estrangela and West Syriac, adding diacritics, data augmentation, and exploring alternative algorithms. The work thus contributes a practical framework and dataset to advance computational access to Syriac cultural heritage through handwritten OCR, enabling more robust digital humanities analyses and language technology for endangered scripts.

Abstract

Paper Structure (18 sections, 4 equations, 12 figures, 5 tables)

This paper contains 18 sections, 4 equations, 12 figures, 5 tables.

Introduction
Syriac Language and Script - An Overview
Related Works
Analytical Approach
Holistic Approach
Summary
Methodology
Data collection
Preprocessing
Tesseract-OCR
Train/Eval Split
Evaluation Criteria
KHAMIS Dataset
Experiments and Results
Training Results
...and 3 more sections

Figures (12)

Figure 1: East Syriac (Madnḥāyā) Script omniglot
Figure 2: Estrangela Script omniglot
Figure 3: West Syriac (Serṭā) Script omniglot
Figure 4: Sample of a page from the dataset template form
Figure 5: Unclear sentence image
...and 7 more figures

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

TL;DR

Abstract

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (12)