Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Blnd Yaseen; Hossein Hassani

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Blnd Yaseen, Hossein Hassani

TL;DR

The paper tackles the problem of making historical Kurdish publications processable by enhancing OCR with an Arabic-base Tesseract v5 model and a newly created line-image dataset of 1233 lines. It details data collection, preprocessing, and training, reporting a structured evaluation with a lstmeval CER of $0.755\%$ and an OCREval character accuracy of $84.02\%$, and it culminates in a user-friendly web application for page-image text extraction. The main contributions are the Kurdish historical line-image dataset, the adaptation of an Arabic OCR model to Kurdish, and a practical evaluation demonstrating promising OCR performance on fragile, non-standard scripts, addressing the lack of public Kurdish datasets. This work advances Kurdish digital humanities by enabling scalable processing of historical texts and providing a foundation for further production-ready OCR enhancements and dataset expansion.

Abstract

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

TL;DR

and an OCREval character accuracy of

, and it culminates in a user-friendly web application for page-image text extraction. The main contributions are the Kurdish historical line-image dataset, the adaptation of an Arabic OCR model to Kurdish, and a practical evaluation demonstrating promising OCR performance on fragile, non-standard scripts, addressing the lack of public Kurdish datasets. This work advances Kurdish digital humanities by enabling scalable processing of historical texts and providing a foundation for further production-ready OCR enhancements and dataset expansion.

Abstract

Paper Structure (36 sections, 21 figures, 2 tables)

This paper contains 36 sections, 21 figures, 2 tables.

Introduction
Printing Press in Iraq and Iraqi Kurdistan
Challenges in Historical Documents
Uneven Illumination
Contrast Variation
Bleed-Through Degradation
Faded Ink or Faint Characters
Smear or Show Through
Blur
Thin or Weak Text
Deteriorated Documents
Kurdish Language
Related work
Arabic/Persian
Chinese/Japanese
...and 21 more sections

Figures (21)

Figure 1: A sample page from the book titled 'Deste Gullî Lawane' published in 1939 (Zheen Center for Documentation and Research).
Figure 2: Hussein Huzni's First Press (Dr. Kurdistan Mukryani's Archive)
Figure 3: Most frequently seen degraded defects in historical documents sulaiman2019degraded
Figure 4: Uneven illumination in handwritten historical document from Arabic databases sulaiman2019degraded
Figure 5: Degraded document image showing variation of contrast sulaiman2019degraded
...and 16 more figures

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

TL;DR

Abstract

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Authors

TL;DR

Abstract

Table of Contents

Figures (21)