Table of Contents
Fetching ...

Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek

Giuseppe G. A. Celano

TL;DR

OGA beta 0.1.0 delivers the largest open-access multilayer corpus of Ancient Greek to date, compiling 1,687 CTS-compliant texts and over 34 million tokens from PerseusDL and OpenGreekAndLatin. It provides seven independent annotation layers, stored in PAULA XML and LAULA XML to maximize scalability and reuse, with morphosyntactic labeling produced by the COMBO parser trained on AGDT 2.1 and rule-based tokenization and CTS tagging guided by TEI/EpiDoc sources. The work analyzes encoding normalization and the trade-offs of standoff formats for large-scale corpora, and releases resources on Zenodo to support reproducible research in philology and computational linguistics. This corpus enables rigorous cross-layer linguistic analysis of Ancient Greek and establishes a scalable framework for future annotation expansions, including additional layers such as IPA transcription. Overall, the paper presents a practical, extensible resource that enhances scholarly access and computational processing of Ancient Greek texts at scale.

Abstract

In this article, the beta version 0.1.0 of Opera Graeca Adnotata (OGA), the largest open-access multilayer corpus for Ancient Greek (AG) is presented. OGA consists of 1,687 literary works and 34M+ tokens coming from the PerseusDL and OpenGreekAndLatin GitHub repositories, which host AG texts ranging from about 800 BCE to about 250 CE. The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer. The creation of each layer is described by highlighting the main technical and annotation-related issues encountered. Tokenization, sentence segmentation, and CTS citation are performed by rule-based algorithms, while morphosyntactic annotation is the output of the COMBO parser trained on the data of the Ancient Greek Dependency Treebank. For the sake of scalability and reusability, the corpus is released in the standoff formats PAULA XML and its offspring LAULA XML.

Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek

TL;DR

OGA beta 0.1.0 delivers the largest open-access multilayer corpus of Ancient Greek to date, compiling 1,687 CTS-compliant texts and over 34 million tokens from PerseusDL and OpenGreekAndLatin. It provides seven independent annotation layers, stored in PAULA XML and LAULA XML to maximize scalability and reuse, with morphosyntactic labeling produced by the COMBO parser trained on AGDT 2.1 and rule-based tokenization and CTS tagging guided by TEI/EpiDoc sources. The work analyzes encoding normalization and the trade-offs of standoff formats for large-scale corpora, and releases resources on Zenodo to support reproducible research in philology and computational linguistics. This corpus enables rigorous cross-layer linguistic analysis of Ancient Greek and establishes a scalable framework for future annotation expansions, including additional layers such as IPA transcription. Overall, the paper presents a practical, extensible resource that enhances scholarly access and computational processing of Ancient Greek texts at scale.

Abstract

In this article, the beta version 0.1.0 of Opera Graeca Adnotata (OGA), the largest open-access multilayer corpus for Ancient Greek (AG) is presented. OGA consists of 1,687 literary works and 34M+ tokens coming from the PerseusDL and OpenGreekAndLatin GitHub repositories, which host AG texts ranging from about 800 BCE to about 250 CE. The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer. The creation of each layer is described by highlighting the main technical and annotation-related issues encountered. Tokenization, sentence segmentation, and CTS citation are performed by rule-based algorithms, while morphosyntactic annotation is the output of the COMBO parser trained on the data of the Ancient Greek Dependency Treebank. For the sake of scalability and reusability, the corpus is released in the standoff formats PAULA XML and its offspring LAULA XML.
Paper Structure (11 sections, 2 figures, 1 table)

This paper contains 11 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Logic of standoff annotation layers in PAULA XML
  • Figure 2: Architecture of Standoff PAULA XML for OGA