Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

Louis Abraham; Charles Arnal; Antoine Marie

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

Louis Abraham, Charles Arnal, Antoine Marie

TL;DR

The paper investigates how prompt formulation affects LLM-based text annotation in social sciences, revealing substantial performance variability across prompts. It introduces Automatic Prompt Optimization (APO) and the Prompt Ultra browser tool to systematically generate and compare prompts, demonstrating APO's ability to achieve robust accuracy across diverse tasks. Through experiments on tasks like hate, emotion, sentiment, and political orientation classification, the work shows that no single handcrafted prompt consistently outperforms others, while APO provides reliable improvements with practical ease of use. The study highlights replicability and training-data leakage concerns in LLM-based labeling and suggests future work on explainability and confidence-aware labeling for real-world research applications.

Abstract

Large Language Models have recently been applied to text annotation tasks from social sciences, equalling or surpassing the performance of human workers at a fraction of the cost. However, no inquiry has yet been made on the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at https://prompt-ultra.github.io/ .

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 6 figures, 2 tables)

This paper contains 16 sections, 6 figures, 2 tables.

Introduction
Automatic text labelling using LLMs
Limitations of automatic text labelling
Related Works on Prompting for Text Classification
Prompt Ultra, our automatic dataset labelling app
Experiments
Datasets and tasks
Hand-crafted prompts
Automatic prompt optimization (APO)
Results and discussion
Conclusion
Data availability statement
Conflict of interest
Results of the main experiments as numerical tables
Optimized prompts
...and 1 more sections

Figures (6)

Figure 1: A screenshot of one of the website's tabs.
Figure 2: Micro-averaged $F_1$ scores (in $\%$) of the hand-crafted prompts and of the best prompt obtained using automatic prompt optimization (APO) on each of the datasets and tasks described in Subsection \ref{['subsec:datasets']}. $95\%$ confidence intervals are represented. The same results are reported as a numerical table in the Appendix.
Figure 3: Micro-averaged $F_1$ scores (in $\%$) of the hand-crafted prompts on the train set and on the test set of the TE-hate and TE-emotion datasets. $95\%$ confidence intervals are represented. The same results are reported as a numerical table in the Appendix.
Figure 4: The EVÅL tab.
Figure 5: The Ø PTIM tab.
...and 1 more figures

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

TL;DR

Abstract

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)