Dependency Annotation of Ottoman Turkish with Multilingual BERT
Şaziye Betül Özateş, Tarık Emre Tıraş, Efe Eren Genç, Esma Fatıma Bilgin Taşdemir
TL;DR
This work tackles the scarcity of Ottoman Turkish annotated data by proposing an iterative, human-in-the-loop annotation pipeline that leverages a multilingual BERT-based parser to generate pseudo-annotations, which are then corrected and used to fine-tune the model. The authors introduce OTA-BOUN, the first UD-style Ottoman Turkish treebank, with data from seven late-Ottoman texts in two scripts, and establish an annotation scheme with strong inter-annotator agreement. Through two iterative batches, they show that fine-tuning on Ottoman Turkish data improves unlabeled attachments while presenting challenges in relation labeling due to Persian-influenced syntax, loanwords, and longer sentences. The approach demonstrates a practical path to building historical language resources under UD, enabling automated analysis of Ottoman Turkish documents and broader historical NLP research.
Abstract
This study introduces a pretrained large language model-based annotation methodology for the first de dency treebank in Ottoman Turkish. Our experimental results show that, iteratively, i) pseudo-annotating data using a multilingual BERT-based parsing model, ii) manually correcting the pseudo-annotations, and iii) fine-tuning the parsing model with the corrected annotations, we speed up and simplify the challenging dependency annotation process. The resulting treebank, that will be a part of the Universal Dependencies (UD) project, will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
