Table of Contents
Fetching ...

Classifier identification in Ancient Egyptian as a low-resource sequence-labelling task

Dmitry Nikolaev, Jorke Grotenhuis, Haleli Harel, Orly Goldwasser

TL;DR

This work treats the identification of graphemic classifiers (determinatives) in Ancient Egyptian as a low-resource sequence-labelling task, leveraging iClassifier-annotated data from the Coffin Texts corpus. It evaluates three sequence-to-sequence model families—LSTM-char, LSTM-sign, and ByT5-small—across two tokenisation regimes and a binary CLF-label output formulation, benchmarking against simple baselines. Results show ByT5-small achieving the strongest in-domain performance ($ ext{dev}=0.08$, $ ext{test}=0.10$ misclassified signs per data point) and good out-of-domain performance ($0.35$), with signbased LSTMs also competitive, while character-based models lag. The study highlights generalization challenges due to limited training data and diachronic/genre variation, and points to future work on finer-grained CLF taxonomy (semantic vs grammatical vs phonetic) and generative modeling within the iClassifier framework to support cross-script analyses.

Abstract

The complex Ancient Egyptian (AE) writing system was characterised by widespread use of graphemic classifiers (determinatives): silent (unpronounced) hieroglyphic signs clarifying the meaning or indicating the pronunciation of the host word. The study of classifiers has intensified in recent years with the launch and quick growth of the iClassifier project, a web-based platform for annotation and analysis of classifiers in ancient and modern languages. Thanks to the data contributed by the project participants, it is now possible to formulate the identification of classifiers in AE texts as an NLP task. In this paper, we make first steps towards solving this task by implementing a series of sequence-labelling neural models, which achieve promising performance despite the modest amount of training data. We discuss tokenisation and operationalisation issues arising from tackling AE texts and contrast our approach with frequency-based baselines.

Classifier identification in Ancient Egyptian as a low-resource sequence-labelling task

TL;DR

This work treats the identification of graphemic classifiers (determinatives) in Ancient Egyptian as a low-resource sequence-labelling task, leveraging iClassifier-annotated data from the Coffin Texts corpus. It evaluates three sequence-to-sequence model families—LSTM-char, LSTM-sign, and ByT5-small—across two tokenisation regimes and a binary CLF-label output formulation, benchmarking against simple baselines. Results show ByT5-small achieving the strongest in-domain performance (, misclassified signs per data point) and good out-of-domain performance (), with signbased LSTMs also competitive, while character-based models lag. The study highlights generalization challenges due to limited training data and diachronic/genre variation, and points to future work on finer-grained CLF taxonomy (semantic vs grammatical vs phonetic) and generative modeling within the iClassifier framework to support cross-script analyses.

Abstract

The complex Ancient Egyptian (AE) writing system was characterised by widespread use of graphemic classifiers (determinatives): silent (unpronounced) hieroglyphic signs clarifying the meaning or indicating the pronunciation of the host word. The study of classifiers has intensified in recent years with the launch and quick growth of the iClassifier project, a web-based platform for annotation and analysis of classifiers in ancient and modern languages. Thanks to the data contributed by the project participants, it is now possible to formulate the identification of classifiers in AE texts as an NLP task. In this paper, we make first steps towards solving this task by implementing a series of sequence-labelling neural models, which achieve promising performance despite the modest amount of training data. We discuss tokenisation and operationalisation issues arising from tackling AE texts and contrast our approach with frequency-based baselines.
Paper Structure (14 sections, 1 figure, 2 tables)

This paper contains 14 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: A form of the verb trr 'to race' represented in hieroglyphs and in the Manuel de Codage transcription. The last two signs are unpronounced semantic classifiers putting 'race' in the [MOVEMENT] category.