Table of Contents
Fetching ...

Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

Haneen Al-Homoud, Asma Ibrahim, Murtadha Al-Jubran, Fahad Al-Otaibi, Yazeed Al-Harbi, Daulet Toibazar, Kesen Wang, Pedro J. Moreno

TL;DR

Cross-Lingual SynthDocs tackles the scarcity of Arabic OCR and document understanding resources by introducing a large-scale synthetic corpus that blends realistic page layouts, diacritized Arabic text, and richly annotated tables and charts. The dataset leverages cross-lingual alignment from English resources and layout-preserving translation to generate high-quality Arabic annotations with minimal human labeling. Finetuning Qwen-2.5-VL on SynthDocs yields substantial improvements in OCR metrics (WER/CER) and in structure parsing metrics (TEDS, CharTeX), closely approaching or surpassing state-of-the-art baselines on challenging, style-heavy Arabic data. Overall, SynthDocs provides a scalable, visually realistic resource that advances multilingual document analysis and facilitates broader LVLM benchmarking in Arabic and multilingual contexts.

Abstract

Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.

Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

TL;DR

Cross-Lingual SynthDocs tackles the scarcity of Arabic OCR and document understanding resources by introducing a large-scale synthetic corpus that blends realistic page layouts, diacritized Arabic text, and richly annotated tables and charts. The dataset leverages cross-lingual alignment from English resources and layout-preserving translation to generate high-quality Arabic annotations with minimal human labeling. Finetuning Qwen-2.5-VL on SynthDocs yields substantial improvements in OCR metrics (WER/CER) and in structure parsing metrics (TEDS, CharTeX), closely approaching or surpassing state-of-the-art baselines on challenging, style-heavy Arabic data. Overall, SynthDocs provides a scalable, visually realistic resource that advances multilingual document analysis and facilitates broader LVLM benchmarking in Arabic and multilingual contexts.

Abstract

Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.

Paper Structure

This paper contains 25 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Examples of synthetic tables: (a) a consistent style table with uniform formatting, (b) a random style table with varied fonts and content, and (c) a table generated from the ArXiv synthetic subset.
  • Figure 2: Examples of synthetically generated charts: (a) Dual-axis chart, (b) Heatmap, (c) Area chart, and (d) Doughnut chart.