Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Meysam Alizadeh; Maël Kubli; Zeynab Samei; Shirin Dehghani; Mohammadmasiha Zahedivafa; Juan Diego Bermeo; Maria Korobeynikova; Fabrizio Gilardi

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Mohammadmasiha Zahedivafa, Juan Diego Bermeo, Maria Korobeynikova, Fabrizio Gilardi

TL;DR

This study evaluates open-source LLMs for political-text annotation, comparing zero-shot, few-shot, and fine-tuned settings against GPT-3.5/4 across multiple datasets and annotation tasks. It demonstrates that fine-tuning open-source models (using LoRA adapters and 4-bit quantization) often closes the gap to proprietary models and can surpass few-shot and zero-shot baselines, while few-shot results are task-dependent. The authors provide a cost-conscious, reproducible workflow and show that open-source LLMs offer advantages in transparency, data protection, and accessibility, with practical recommendations on data size, temperature, and model selection. The work culminates in actionable guidance for researchers to leverage fine-tuned open-source LLMs for robust text annotation in political science. A accompanying Python notebook and replication package further facilitate adoption and benchmarking in future studies.

Abstract

This paper studies the performance of open-source Large Language Models (LLMs) in text classification tasks typical for political science research. By examining tasks like stance, topic, and relevance classification, we aim to guide scholars in making informed decisions about their use of LLMs for text analysis. Specifically, we conduct an assessment of both zero-shot and fine-tuned LLMs across a range of text annotation tasks using news articles and tweets datasets. Our analysis shows that fine-tuning improves the performance of open-source LLMs, allowing them to match or even surpass zero-shot GPT-3.5 and GPT-4, though still lagging behind fine-tuned GPT-3.5. We further establish that fine-tuning is preferable to few-shot training with a relatively modest quantity of annotated text. Our findings show that fine-tuned open-source LLMs can be effectively deployed in a broad spectrum of text annotation applications. We provide a Python notebook facilitating the application of LLMs in text annotation for other researchers.

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

TL;DR

Abstract

Paper Structure (54 sections, 7 figures, 2 tables)

This paper contains 54 sections, 7 figures, 2 tables.

Introduction
Related Work
Materials and Methods
Data
Data Annotation Tasks
Trained Annotators
Crowd-workers
LLM Selection and Settings
Prompt Engineering
LLM Fine-Tuning
Evaluation Metrics
Results
Choosing the training approach: Zero-Shot versus Few-Shot
Temperature Setting: Higher versus Lower
Model Selection: Proprietary versus Open-Source LLMs
...and 39 more sections

Figures (7)

Figure 1: Comparing zero- and few-shot text annotation of GPT-3.5, GPT-4, and LLaMA-1 (HuggingChat). The x-axis shows the accuracy. The y-axis displays the two models grouped by the model configuration, including Zero-Shot and Few-Shot. Facets represent distinct tasks and/or datasets for evaluating model configurations.
Figure 2: Analyzing the effect of LLaMA-1 (HuggingChat)'s temperature parameter on accuracy and intercoder agreement in text annotation tasks.
Figure 3: Analyzing the effect of GPT-3.5's temperature parameter on accuracy and intercoder agreement in text annotation tasks.
Figure 4: Accuracy of GPT-3.5, GPT-4, open-source LLMs, and MTurk. Accuracy means agreement with trained annotators. Bars indicate average accuracy, while whiskers range from minimum to maximum accuracy across models with different parameters and/or prompts (zero vs few shot).
Figure 5: Performance (accuracy) of GPT-3.5, LLaMA-1, LLaMA-2, and FLAN-T5 (XL), as a function of the training data size for fine-tuning. The x-axis shows different sizes of training datasets, ranging from zero-shot (no fine-tuning) to 50, 100, 250, 500, and 1,000 rows used for fine-tuning the models. The y-axis displays the accuracy of the models in percentages. Facets represent distinct tasks and/or datasets for evaluating the models. Pink dots represent zero-shot GPT-4 accuracy for the sake of comparison.
...and 2 more figures

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

TL;DR

Abstract

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)