Table of Contents
Fetching ...

Evaluation is all you need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer using Open Models

Maximilian Weber, Merle Reichardt

TL;DR

This paper advocates for using open-source LLMs for social science annotation to mitigate reproducibility and privacy concerns associated with proprietary models. It evaluates multiple prompting strategies (zero-/one-/few-shot, chain-of-thought, self-consistency, prompt patterns) across two tasks—tweet sentiment analysis and leisure-activity detection in childhood essays—using five open 7B models. Results indicate generally moderate agreement with gold data, with performance varying by task and prompting approach; few-shot and CoT strategies yield benefits in different contexts. The study underscores the importance of task-specific prompt engineering and model selection, highlighting open models' practical advantages for data privacy and reproducibility while acknowledging environmental and linguistic-bias considerations. Replication resources are provided to enable adoption and further development in social science annotation with open LLMs.

Abstract

This paper explores the use of open generative Large Language Models (LLMs) for annotation tasks in the social sciences. The study highlights the challenges associated with proprietary models, such as limited reproducibility and privacy concerns, and advocates for the adoption of open (source) models that can be operated on independent devices. Two examples of annotation tasks, sentiment analysis in tweets and identification of leisure activities in childhood aspirational essays are provided. The study evaluates the performance of different prompting strategies and models (neural-chat-7b-v3-2, Starling-LM-7B-alpha, openchat_3.5, zephyr-7b-alpha and zephyr-7b-beta). The results indicate the need for careful validation and tailored prompt engineering. The study highlights the advantages of open models for data privacy and reproducibility.

Evaluation is all you need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer using Open Models

TL;DR

This paper advocates for using open-source LLMs for social science annotation to mitigate reproducibility and privacy concerns associated with proprietary models. It evaluates multiple prompting strategies (zero-/one-/few-shot, chain-of-thought, self-consistency, prompt patterns) across two tasks—tweet sentiment analysis and leisure-activity detection in childhood essays—using five open 7B models. Results indicate generally moderate agreement with gold data, with performance varying by task and prompting approach; few-shot and CoT strategies yield benefits in different contexts. The study underscores the importance of task-specific prompt engineering and model selection, highlighting open models' practical advantages for data privacy and reproducibility while acknowledging environmental and linguistic-bias considerations. Replication resources are provided to enable adoption and further development in social science annotation with open LLMs.

Abstract

This paper explores the use of open generative Large Language Models (LLMs) for annotation tasks in the social sciences. The study highlights the challenges associated with proprietary models, such as limited reproducibility and privacy concerns, and advocates for the adoption of open (source) models that can be operated on independent devices. Two examples of annotation tasks, sentiment analysis in tweets and identification of leisure activities in childhood aspirational essays are provided. The study evaluates the performance of different prompting strategies and models (neural-chat-7b-v3-2, Starling-LM-7B-alpha, openchat_3.5, zephyr-7b-alpha and zephyr-7b-beta). The results indicate the need for careful validation and tailored prompt engineering. The study highlights the advantages of open models for data privacy and reproducibility.
Paper Structure (14 sections, 3 figures, 6 tables)

This paper contains 14 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Decision tree for the use of generative LLM for text annotation
  • Figure 2: Evaluation results for the sentiment annotation for tweets. The figure displays 75 different prediction approaches, which include 5 models with 15 prompts each
  • Figure 3: Evaluation results for the annotation indicating whether leisure activities are mentioned in childhood essays. The figure displays 75 different prediction approaches, which include 5 models with 15 prompts each