Towards Human-Level Text Coding with LLMs: The Case of Fatherhood Roles in Public Policy Documents
Lorenzo Lupo, Oscar Magnusson, Dirk Hovy, Elin Naurin, Lena Wängnerud
TL;DR
This paper investigates automated text coding in political science using large language models by prompting them with a full human codebook and labeled examples. It evaluates GPT-3, GPT-4, and open-source LLMs on a Swedish case study of fatherhood roles in policy documents, showing that multi-task prompting with exhaustive label descriptions can reach or exceed human coder performance while significantly reducing time and cost. The results indicate GPT-4 excels on complex tasks, while open-source models can be viable on simpler tasks, and joint three-task coding is considerably cheaper than separate runs. The authors provide open-source tooling and a detailed appendix to guide replication and practical deployment in large-scale political text annotation.
Abstract
Recent advances in large language models (LLMs) like GPT-3.5 and GPT-4 promise automation with better results and less programming, opening up new opportunities for text analysis in political science. In this study, we evaluate LLMs on three original coding tasks involving typical complexities encountered in political science settings: a non-English language, legal and political jargon, and complex labels based on abstract constructs. Along the paper, we propose a practical workflow to optimize the choice of the model and the prompt. We find that the best prompting strategy consists of providing the LLMs with a detailed codebook, as the one provided to human coders. In this setting, an LLM can be as good as or possibly better than a human annotator while being much faster, considerably cheaper, and much easier to scale to large amounts of text. We also provide a comparison of GPT and popular open-source LLMs, discussing the trade-offs in the model's choice. Our software allows LLMs to be easily used as annotators and is publicly available: https://github.com/lorelupo/pappa.
