Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models
Louis Abraham, Charles Arnal, Antoine Marie
TL;DR
The paper investigates how prompt formulation affects LLM-based text annotation in social sciences, revealing substantial performance variability across prompts. It introduces Automatic Prompt Optimization (APO) and the Prompt Ultra browser tool to systematically generate and compare prompts, demonstrating APO's ability to achieve robust accuracy across diverse tasks. Through experiments on tasks like hate, emotion, sentiment, and political orientation classification, the work shows that no single handcrafted prompt consistently outperforms others, while APO provides reliable improvements with practical ease of use. The study highlights replicability and training-data leakage concerns in LLM-based labeling and suggests future work on explainability and confidence-aware labeling for real-world research applications.
Abstract
Large Language Models have recently been applied to text annotation tasks from social sciences, equalling or surpassing the performance of human workers at a fraction of the cost. However, no inquiry has yet been made on the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at https://prompt-ultra.github.io/ .
