Table of Contents
Fetching ...

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero

TL;DR

This paper investigates the capacity of recent large language models, including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages.

Abstract

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

TL;DR

This paper investigates the capacity of recent large language models, including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages.

Abstract

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.
Paper Structure (20 sections, 1 equation, 4 figures, 7 tables)

This paper contains 20 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Annotation guidelines and tagset for Greek, Armenian, Georgian and Syriac, using @ to split agglutinated and polylexical forms
  • Figure 2: Schematic representation of the prompt structure used for all languages.
  • Figure 3: Out-of-domain accuracy for lemmatization and POS tagging across our four historical languages: comparing the supervised PIE baseline with best open-weight and closed LLM annotators; values report maximum accuracy by task and language.
  • Figure 4: Figure 4. Accuracy as a function of the number of in-context examples (0, 5, 50, 500 shots) for lemmatization and POS tagging, reported separately for each language (GRC, HYE, KAT, SYC). Solid lines show the model accuracy at each shot count; shaded bands span the two evaluation conditions and thus indicate the range between in-domain and out-of-domain accuracies at the same shot count. Dashed horizontal lines mark the supervised PIE reference accuracies obtained with 1,000 and 5,000 training examples.