eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey
Krzysztof Nowak, Jędrzej Ziębura, Krzysztof Wróbel, Aleksander Smywiński-Pohl
TL;DR
This work introduces eFontes, a family of transformer-based models for automatic annotation of Medieval Latin texts, including lemmatization, POS tagging, and morphological feature tagging, trained on UD corpora and a new Polish Medieval Latin resource. Through multiple training scenarios, the study demonstrates that domain adaptation—finetuning UD data followed by eFontes domain data—yields the strongest performance, with lemmatization, POS, and morph features achieving high accuracies across several genres. A detailed qualitative error analysis identifies orthographic variation and Latinized vernacular terms as primary error sources, informing future data harmonization and model improvements. The authors plan to extend the work to Named Entity Recognition and to broaden genre coverage, underscoring the importance of high-quality annotated corpora for historical-language processing and enabling broader downstream research.
Abstract
This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models' performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.
