LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

Nataliia Kholodna; Sahib Julka; Mohammad Khodadadi; Muhammed Nurullah Gumus; Michael Granitzer

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, Michael Granitzer

TL;DR

Low-resource languages suffer from scarce labeled data, hindering AI deployment. The authors propose a framework that embeds foundation models into an active learning loop to annotate NER data for African languages (MasakhaNER 2.0), reducing annotation effort while achieving near-state-of-the-art performance. Through comprehensive evaluation of multiple LLMs, they select GPT-4-Turbo for annotation tasks, introduce representative sampling, prompt design, batching, and data-contamination checks, and demonstrate significant data and cost savings in an AL setting. The work shows substantial potential to broaden inclusion of low-resource languages and guide future automation efforts, with reported cost savings of at least 42.45x compared to human annotation and minimal data leakage.

Abstract

Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

TL;DR

Abstract

Paper Structure (20 sections, 8 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 8 figures, 6 tables, 1 algorithm.

Introduction
LLMs in the Loop
Experiments
Foundation Model Selection
Representative Data Subset Sampling for LLM Evaluation
Querying LLMs
Correct Output Format
Inter Annotator Agreement
Consistency
LLM Evaluation Results
Effect of Prompt Design and Querying LLMs in batches
Data Contamination
Active Learning
Conclusion
Acknowledgments.
...and 5 more sections

Figures (8)

Figure 1: Overview of our methodology. The process involves selecting the most informative samples from the training set, and querying the LLM with a pre-defined prompt template to obtain annotations. The problem-specific classifier is then trained with these queried annotations and evaluated on the unseen test set.
Figure 2: Entity distribution in a) the overall Bambara dataset and b) in 50 sampled records using (cf. Supplementary Material). Non-entities are excluded from the chart.
Figure 3: Condensed version of the full prompt template with non-essential details abbreviated as [...].
Figure 4: Prompt template for instructing the LLM to identify the source dataset of a given data sample. The placeholder {sentence} is replaced with the actual data sample for this task. For multilingual datasets, the term 'multilingual' is specified in the prompt, while it is omitted for monolingual datasets.
Figure 5: Accuracy (without non-entities) for Bambara test set achieved by using ground truth (left) and GPT-4-Turbo annotations (right) in our active learning framework. X-axis denotes the percentage of the dataset used for active learning iteration, and red dashed line represents our baseline - simple AfroXLMR-mini training using 100 % of the dataset without active learning.
...and 3 more figures

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

TL;DR

Abstract

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (8)