Table of Contents
Fetching ...

Dependency Annotation of Ottoman Turkish with Multilingual BERT

Şaziye Betül Özateş, Tarık Emre Tıraş, Efe Eren Genç, Esma Fatıma Bilgin Taşdemir

TL;DR

This work tackles the scarcity of Ottoman Turkish annotated data by proposing an iterative, human-in-the-loop annotation pipeline that leverages a multilingual BERT-based parser to generate pseudo-annotations, which are then corrected and used to fine-tune the model. The authors introduce OTA-BOUN, the first UD-style Ottoman Turkish treebank, with data from seven late-Ottoman texts in two scripts, and establish an annotation scheme with strong inter-annotator agreement. Through two iterative batches, they show that fine-tuning on Ottoman Turkish data improves unlabeled attachments while presenting challenges in relation labeling due to Persian-influenced syntax, loanwords, and longer sentences. The approach demonstrates a practical path to building historical language resources under UD, enabling automated analysis of Ottoman Turkish documents and broader historical NLP research.

Abstract

This study introduces a pretrained large language model-based annotation methodology for the first de dency treebank in Ottoman Turkish. Our experimental results show that, iteratively, i) pseudo-annotating data using a multilingual BERT-based parsing model, ii) manually correcting the pseudo-annotations, and iii) fine-tuning the parsing model with the corrected annotations, we speed up and simplify the challenging dependency annotation process. The resulting treebank, that will be a part of the Universal Dependencies (UD) project, will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.

Dependency Annotation of Ottoman Turkish with Multilingual BERT

TL;DR

This work tackles the scarcity of Ottoman Turkish annotated data by proposing an iterative, human-in-the-loop annotation pipeline that leverages a multilingual BERT-based parser to generate pseudo-annotations, which are then corrected and used to fine-tune the model. The authors introduce OTA-BOUN, the first UD-style Ottoman Turkish treebank, with data from seven late-Ottoman texts in two scripts, and establish an annotation scheme with strong inter-annotator agreement. Through two iterative batches, they show that fine-tuning on Ottoman Turkish data improves unlabeled attachments while presenting challenges in relation labeling due to Persian-influenced syntax, loanwords, and longer sentences. The approach demonstrates a practical path to building historical language resources under UD, enabling automated analysis of Ottoman Turkish documents and broader historical NLP research.

Abstract

This study introduces a pretrained large language model-based annotation methodology for the first de dency treebank in Ottoman Turkish. Our experimental results show that, iteratively, i) pseudo-annotating data using a multilingual BERT-based parsing model, ii) manually correcting the pseudo-annotations, and iii) fine-tuning the parsing model with the corrected annotations, we speed up and simplify the challenging dependency annotation process. The resulting treebank, that will be a part of the Universal Dependencies (UD) project, will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
Paper Structure (16 sections, 5 figures, 1 table)

This paper contains 16 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Two Ottoman Turkish sentences (in Latin script), their modernized versions and English translations.
  • Figure 2: Dependency tree representations of an Ottoman Turkish sentence (above) and its rephrased version using the modern Turkish (below). The highlighted portions enclosed in colored circles indicate corresponding segments in the sentences. English translations of words are provided in italics within parentheses. Words of a sentence that do not exist in the other sentence are underlined in the figure. English translation of the sentence: "The late Damat İbrahim Paşa succeeded in developing Muşkara, his birthplace, and turning it into a town."
  • Figure 3: CoNLL-U Representation of an example sentence from our Ottoman BOUN UD Treebank.
  • Figure 4: The experimental setup using the iterative annotation scheme.
  • Figure 5: Confusion matrices of gold and predicted dependency types on the first batch and the second batch. The x-axis in each plot shows the dependency types in the pseudo-annotations of the corresponding batch. The y-axis shows the dependency types in the gold annotations.