The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

Chahan Vidal-Gorène; Bastien Kindt

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

Chahan Vidal-Gorène, Bastien Kindt

TL;DR

The Patrologia Graeca Corpus is presented, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek and establishes a new benchmark for OCR on noisy polytonic Greek.

Abstract

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

TL;DR

Abstract

Paper Structure (17 sections, 6 figures, 3 tables)

This paper contains 17 sections, 6 figures, 3 tables.

Introduction
Previous Initiatives and the Challenges of the Patrologia Graeca
Digital initiatives around the PG
OCR and Editorial Challenges
Related Work in Greek OCR and Document Analysis
Methodology and dataset for PG OCR and analysis
Data Preparation
Model Architecture and Fine-tuning
Results and Corpus Release
OCR and Layout Performance
Corpus Characteristics and Visualization
Output Format and Public Release
Conclusion
Data Availability
Acknowledgements
...and 2 more sections

Figures (6)

Figure 1: Example of page layout in PG with different semantic zones. Green regions correspond to the targeted Greek text.
Figure 2: Typography ambiguity in PG, variation of the character alpha with diacritics. See Section \ref{['sec:results_and_corpus_release']} for impact on OCR results
Figure 3: Overview of the OCR and annotation workflow for the Patrologia Graeca corpus.
Figure 4: Character-level confusion matrix (left) and detailed confusion patterns for iota, omega, epsilon, and omicron showing diacritic variation
Figure 5: Comparison of semantic and visual distributions of ancient Greek corpora.
...and 1 more figures

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

TL;DR

Abstract

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)