TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages
Aleksei Dorkin, Kairit Sirts
TL;DR
The paper tackles word embedding evaluation for ancient and historical languages by adapting XLM-RoBERTa via a stacked, parameter-efficient adapters framework. It trains per-language language adapters and two task adapters (POS/morphology and lemmatization), with custom tokenizers and embeddings when necessary, to handle 16 languages up to 1700 CE. The approach achieves an overall second place on the SIGTYP 2024 unconstrained subtask and first in word-level gap filling, demonstrating that modern pre-trained models can be effectively repurposed for historical languages through adapters. The work emphasizes computational efficiency and portability, suggesting future enhancements through adapter fusion and improved embedding strategies to further close gaps for underrepresented scripts.
Abstract
We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling. We developed a simple, uniform, and computationally lightweight approach based on the adapters framework using parameter-efficient fine-tuning. We applied the same adapter-based approach uniformly to all tasks and 16 languages by fine-tuning stacked language- and task-specific adapters. Our submission obtained an overall second place out of three submissions, with the first place in word-level gap-filling. Our results show the feasibility of adapting language models pre-trained on modern languages to historical and ancient languages via adapter training.
