Table of Contents
Fetching ...

A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek

Giuseppe G. A. Celano

TL;DR

The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships.

Abstract

This paper presents an experiment consisting in the comparison of six models to identify a state-of-the-art morphosyntactic parser and lemmatizer for Ancient Greek capable of annotating according to the Ancient Greek Dependency Treebank annotation scheme. A normalized version of the major collections of annotated texts was used to (i) train the baseline model Dithrax with randomly initialized character embeddings and (ii) fine-tune Trankit and four recent models pretrained on Ancient Greek texts, i.e., GreBERTa and PhilBERTa for morphosyntactic annotation and GreTA and PhilTa for lemmatization. A Bayesian analysis shows that Dithrax and Trankit annotate morphology practically equivalently, while syntax is best annotated by Trankit and lemmata by GreTa. The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships. The dataset and best-performing models are made available online for reuse.

A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek

TL;DR

The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships.

Abstract

This paper presents an experiment consisting in the comparison of six models to identify a state-of-the-art morphosyntactic parser and lemmatizer for Ancient Greek capable of annotating according to the Ancient Greek Dependency Treebank annotation scheme. A normalized version of the major collections of annotated texts was used to (i) train the baseline model Dithrax with randomly initialized character embeddings and (ii) fine-tune Trankit and four recent models pretrained on Ancient Greek texts, i.e., GreBERTa and PhilBERTa for morphosyntactic annotation and GreTA and PhilTa for lemmatization. A Bayesian analysis shows that Dithrax and Trankit annotate morphology practically equivalently, while syntax is best annotated by Trankit and lemmata by GreTa. The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships. The dataset and best-performing models are made available online for reuse.

Paper Structure

This paper contains 14 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Main layers of Dithrax, the baseline model architecture. Blue color stands for tahn(linear(x)), while orange for softmax(linear(x)) (with $\times$ meaning dot product and $+$ concatenation).
  • Figure 2: Posteriors of the Bayesian correlated t-test for all model pairs with reference to POS scores.
  • Figure 3: Posteriors of the Bayesian correlated t-test for all model pairs with reference to XPOS scores.
  • Figure 4: Posteriors of the Bayesian correlated t-test for all model pairs with reference to Feats scores.
  • Figure 5: Posteriors of the Bayesian correlated t-test for all model pairs with reference to AllTags scores.
  • ...and 3 more figures