Table of Contents
Fetching ...

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

Louis Abraham, Charles Arnal, Antoine Marie

TL;DR

The paper investigates how prompt formulation affects LLM-based text annotation in social sciences, revealing substantial performance variability across prompts. It introduces Automatic Prompt Optimization (APO) and the Prompt Ultra browser tool to systematically generate and compare prompts, demonstrating APO's ability to achieve robust accuracy across diverse tasks. Through experiments on tasks like hate, emotion, sentiment, and political orientation classification, the work shows that no single handcrafted prompt consistently outperforms others, while APO provides reliable improvements with practical ease of use. The study highlights replicability and training-data leakage concerns in LLM-based labeling and suggests future work on explainability and confidence-aware labeling for real-world research applications.

Abstract

Large Language Models have recently been applied to text annotation tasks from social sciences, equalling or surpassing the performance of human workers at a fraction of the cost. However, no inquiry has yet been made on the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at https://prompt-ultra.github.io/ .

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

TL;DR

The paper investigates how prompt formulation affects LLM-based text annotation in social sciences, revealing substantial performance variability across prompts. It introduces Automatic Prompt Optimization (APO) and the Prompt Ultra browser tool to systematically generate and compare prompts, demonstrating APO's ability to achieve robust accuracy across diverse tasks. Through experiments on tasks like hate, emotion, sentiment, and political orientation classification, the work shows that no single handcrafted prompt consistently outperforms others, while APO provides reliable improvements with practical ease of use. The study highlights replicability and training-data leakage concerns in LLM-based labeling and suggests future work on explainability and confidence-aware labeling for real-world research applications.

Abstract

Large Language Models have recently been applied to text annotation tasks from social sciences, equalling or surpassing the performance of human workers at a fraction of the cost. However, no inquiry has yet been made on the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at https://prompt-ultra.github.io/ .
Paper Structure (16 sections, 6 figures, 2 tables)

This paper contains 16 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A screenshot of one of the website's tabs.
  • Figure 2: Micro-averaged $F_1$ scores (in $\%$) of the hand-crafted prompts and of the best prompt obtained using automatic prompt optimization (APO) on each of the datasets and tasks described in Subsection \ref{['subsec:datasets']}. $95\%$ confidence intervals are represented. The same results are reported as a numerical table in the Appendix.
  • Figure 3: Micro-averaged $F_1$ scores (in $\%$) of the hand-crafted prompts on the train set and on the test set of the TE-hate and TE-emotion datasets. $95\%$ confidence intervals are represented. The same results are reported as a numerical table in the Appendix.
  • Figure 4: The EVÅL tab.
  • Figure 5: The Ø PTIM tab.
  • ...and 1 more figures