Table of Contents
Fetching ...

Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

Mohamed Rashad

TL;DR

A large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books is released using the best-performing model, providing a valuable resource for Arabic OCR research.

Abstract

We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at https://github.com/MohamedAliRashad/arabic-nougat.

Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

TL;DR

A large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books is released using the best-performing model, providing a valuable resource for Arabic OCR research.

Abstract

We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at https://github.com/MohamedAliRashad/arabic-nougat.

Paper Structure

This paper contains 20 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of the Arabic-Nougat architecture, illustrating the integration of the Donut Vision Encoder with an auto-regressive MBART decoder for Arabic OCR and Markdown extraction. The diagram highlights key components such as image encoding from an Arabic book page and the overall decoding process.