Table of Contents
Fetching ...

TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages

Aleksei Dorkin, Kairit Sirts

TL;DR

The paper tackles word embedding evaluation for ancient and historical languages by adapting XLM-RoBERTa via a stacked, parameter-efficient adapters framework. It trains per-language language adapters and two task adapters (POS/morphology and lemmatization), with custom tokenizers and embeddings when necessary, to handle 16 languages up to 1700 CE. The approach achieves an overall second place on the SIGTYP 2024 unconstrained subtask and first in word-level gap filling, demonstrating that modern pre-trained models can be effectively repurposed for historical languages through adapters. The work emphasizes computational efficiency and portability, suggesting future enhancements through adapter fusion and improved embedding strategies to further close gaps for underrepresented scripts.

Abstract

We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling. We developed a simple, uniform, and computationally lightweight approach based on the adapters framework using parameter-efficient fine-tuning. We applied the same adapter-based approach uniformly to all tasks and 16 languages by fine-tuning stacked language- and task-specific adapters. Our submission obtained an overall second place out of three submissions, with the first place in word-level gap-filling. Our results show the feasibility of adapting language models pre-trained on modern languages to historical and ancient languages via adapter training.

TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages

TL;DR

The paper tackles word embedding evaluation for ancient and historical languages by adapting XLM-RoBERTa via a stacked, parameter-efficient adapters framework. It trains per-language language adapters and two task adapters (POS/morphology and lemmatization), with custom tokenizers and embeddings when necessary, to handle 16 languages up to 1700 CE. The approach achieves an overall second place on the SIGTYP 2024 unconstrained subtask and first in word-level gap filling, demonstrating that modern pre-trained models can be effectively repurposed for historical languages through adapters. The work emphasizes computational efficiency and portability, suggesting future enhancements through adapter fusion and improved embedding strategies to further close gaps for underrepresented scripts.

Abstract

We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling. We developed a simple, uniform, and computationally lightweight approach based on the adapters framework using parameter-efficient fine-tuning. We applied the same adapter-based approach uniformly to all tasks and 16 languages by fine-tuning stacked language- and task-specific adapters. Our submission obtained an overall second place out of three submissions, with the first place in word-level gap-filling. Our results show the feasibility of adapting language models pre-trained on modern languages to historical and ancient languages via adapter training.
Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustration of the Bottleneck Adapter from houlsby2019parameter. The left side demonstrates how a bottleneck adapter is added to a single transformer layer, while the structure of an individual adapter layer is on the right. Only elements in green are trained, while the rest remains frozen.
  • Figure 2: An illustration of an adapter stack as presented on the AdapterHub documentation page. Blue and green blocks represent different adapter layers stacked on top of each other.
  • Figure 3: A schematic illustration of the decoding process for word-level mask filling. Blue boxes represent current tokens in the sentence, while orange boxes represent the probability distribution of tokens at the position of the mask token. The upper half represents the first step of decoding. We start with predicting the most likely replacement for the leftmost masked token that starts with _ symbol representing the beginning of a new word. Then, we replace the mask with that token and append a new mask token to the right of it, as represented in the middle part. We predict the most likely replacement for the new mask token. If it starts with _ symbol, we discard the mask token, consider the word predicted, and move to the next masked word if it's present. Conversely, if there's no _ in the predicted token, we append it to the previously predicted token. We repeat this process $k$ times, or until we encounter a token starting with _. $k$ is a hyperparameter that may be tuned, however increasing $k$ increases the decoding time significantly. For this reason we set $k$ to 1 for all languages. At the bottom of the figure the final result with no mask tokens is demonstrated. Note that this description is specific to XLM-RoBERTa tokenizer, other model's tokenizers may have different behaviour.