Table of Contents
Fetching ...

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

Chahan Vidal-Gorène, Bastien Kindt

TL;DR

The Patrologia Graeca Corpus is presented, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek and establishes a new benchmark for OCR on noisy polytonic Greek.

Abstract

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

TL;DR

The Patrologia Graeca Corpus is presented, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek and establishes a new benchmark for OCR on noisy polytonic Greek.

Abstract

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.
Paper Structure (17 sections, 6 figures, 3 tables)

This paper contains 17 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example of page layout in PG with different semantic zones. Green regions correspond to the targeted Greek text.
  • Figure 2: Typography ambiguity in PG, variation of the character alpha with diacritics. See Section \ref{['sec:results_and_corpus_release']} for impact on OCR results
  • Figure 3: Overview of the OCR and annotation workflow for the Patrologia Graeca corpus.
  • Figure 4: Character-level confusion matrix (left) and detailed confusion patterns for iota, omega, epsilon, and omicron showing diacritic variation
  • Figure 5: Comparison of semantic and visual distributions of ancient Greek corpora.
  • ...and 1 more figures