Table of Contents
Fetching ...

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

Adrian Chan, Anupam Mijar, Mehreen Saeed, Chau-Wai Wong, Akram Khater

TL;DR

HATFormer addresses the challenge of historical Arabic handwriting recognition under low-resource conditions by adapting a transformer-based English HTR framework with Arabic-specific preprocessing, tokenization, and a synthetic-to-real data strategy. The method combines a BlockProcessor for ViT input preparation, a compact BBPE tokenizer, and a two-stage training regime (synthetic pretraining followed by real-data fine-tuning) to achieve strong CER on historical and contemporary datasets. Key findings include an 8.6% CER on Muharaf and 4.2% on MADCAT, with comprehensive ablations and cross-dataset analyses demonstrating the value of Arabic-specific design choices and synthetic data. This work enables more effective digitization and searchability of historical Arabic archives, contributing to digital humanities and cultural preservation.

Abstract

Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

TL;DR

HATFormer addresses the challenge of historical Arabic handwriting recognition under low-resource conditions by adapting a transformer-based English HTR framework with Arabic-specific preprocessing, tokenization, and a synthetic-to-real data strategy. The method combines a BlockProcessor for ViT input preparation, a compact BBPE tokenizer, and a two-stage training regime (synthetic pretraining followed by real-data fine-tuning) to achieve strong CER on historical and contemporary datasets. Key findings include an 8.6% CER on Muharaf and 4.2% on MADCAT, with comprehensive ablations and cross-dataset analyses demonstrating the value of Arabic-specific design choices and synthetic data. This work enables more effective digitization and searchability of historical Arabic archives, contributing to digital humanities and cultural preservation.

Abstract

Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.
Paper Structure (22 sections, 4 figures, 6 tables)

This paper contains 22 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The architecture of HATFormer. The input text-line image is processed by our BlockProcessor and the BEiT vision transformer. The ground-truth text string is tokenized using our Arabic BBPE tokenizer. The RoBERTa transformer is used for text prediction. HATFormer addresses the three intrinsic challenges of Arabic scripts by leveraging attention and can work on smaller datasets with the help of our synthetic image training pipeline.
  • Figure 2: Top: Our proposed BlockProcessor respects the aspect ratio of (a) an original image and chunks it to fit within (c) a 384$\times$384-pixel ViT image container. In contrast, the base ViT image processor naively resizes images to (d) fully occupy its square image container, resulting in (f) significant horizontal information loss of the vertical strokes when compared to (e) the raw version. Bottom: (g) Synthetic image generation pipeline. Realistic-looking text-line images are generated by randomly selecting words from a large Arabic corpus, rendering with a random font, paper background, and image augmentation.
  • Figure 3: Self- and Cross-attention map visualizations. Yellow highlights areas of greater attention, with attention maps overlaid onto the input image for easier comparison. Left: ViT encoder self-attention maps for selected patch tokens. The top of each column shows the relevant patch, followed by attention maps showing what the transformer attends to as it progresses through its subsequent layers. The leftmost column shows the attention for a diacritic patch. Red lines indicate the layer cutoff where the attention association becomes too broad, as identified by our Arabic expert. Right: RoBERTa decoder cross-attention maps for selected ground truth text tokens. Each row represents consecutive text tokens, read from right to left, with the decoded token string above each map. Tokens are annotated based on their type: red underlines indicate diacritic tokens, green underlines denote subword tokens, and all other tokens correspond to full words, as identified by our Arabic language expert. The attention maps reveal the model's ability to attend to relevant image regions for each token. It can handle a diverse range of text, from small diacritics to complex compounded characters, demonstrating the model's ability to overcome the inherent challenges of Arabic script.
  • Figure 4: (a) The impact of synthetic Stage-1 fine-tuning size on final HTR performance. A larger synthetic Stage-1 fine-tuning dataset allows for better generalization in terms of CER. (b) The CER and latency effect of inference beam size of our model on Muharaf. Using a larger beam size leads to a more accurate model but reduced speed. A beam width of three demonstrates a good trade-off between accuracy and computational speed. (c) The impact of inference length penalty of our model on Muharaf. A length penalty of 0.2 to 0.8 is preferred to achieve the best CER.