Table of Contents
Fetching ...

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

Nicholas Pangakis, Samuel Wolken

TL;DR

This study evaluates automated annotation using GPT-4 within a human-in-the-loop framework across 27 tasks drawn from 11 password-protected computational social science datasets. By ground-truthing LLM outputs against human annotations and comparing to BERT baselines, the authors reveal substantial cross-task variability and a higher recall than precision, suggesting LLMs are best used as a high-recall first stage in a multi-stage workflow. Prompt optimization and temperature tuning yield only modest gains, underscoring the ongoing need for human validation and careful task-by-task evaluation. The findings advocate a human-centered approach to responsible AI in automated annotation, with grounded ground-truth data guiding deployment and interpretation in CSS research.

Abstract

Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs on a small number of tasks and likely suffer from contamination due to a reliance on public benchmark datasets. Here, we test a human-centered framework for responsibly evaluating artificial intelligence tools used in automated annotation. We use GPT-4 to replicate 27 annotation tasks across 11 password-protected datasets from recently published computational social science articles in high-impact journals. For each task, we compare GPT-4 annotations against human-annotated ground-truth labels and against annotations from separate supervised classification models fine-tuned on human-generated labels. Although the quality of LLM labels is generally high, we find significant variation in LLM performance across tasks, even within datasets. Our findings underscore the importance of a human-centered workflow and careful evaluation standards: Automated annotations significantly diverge from human judgment in numerous scenarios, despite various optimization strategies such as prompt tuning. Grounding automated annotation in validation labels generated by humans is essential for responsible evaluation.

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

TL;DR

This study evaluates automated annotation using GPT-4 within a human-in-the-loop framework across 27 tasks drawn from 11 password-protected computational social science datasets. By ground-truthing LLM outputs against human annotations and comparing to BERT baselines, the authors reveal substantial cross-task variability and a higher recall than precision, suggesting LLMs are best used as a high-recall first stage in a multi-stage workflow. Prompt optimization and temperature tuning yield only modest gains, underscoring the ongoing need for human validation and careful task-by-task evaluation. The findings advocate a human-centered approach to responsible AI in automated annotation, with grounded ground-truth data guiding deployment and interpretation in CSS research.

Abstract

Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs on a small number of tasks and likely suffer from contamination due to a reliance on public benchmark datasets. Here, we test a human-centered framework for responsibly evaluating artificial intelligence tools used in automated annotation. We use GPT-4 to replicate 27 annotation tasks across 11 password-protected datasets from recently published computational social science articles in high-impact journals. For each task, we compare GPT-4 annotations against human-annotated ground-truth labels and against annotations from separate supervised classification models fine-tuned on human-generated labels. Although the quality of LLM labels is generally high, we find significant variation in LLM performance across tasks, even within datasets. Our findings underscore the importance of a human-centered workflow and careful evaluation standards: Automated annotations significantly diverge from human judgment in numerous scenarios, despite various optimization strategies such as prompt tuning. Grounding automated annotation in validation labels generated by humans is essential for responsible evaluation.
Paper Structure (20 sections, 1 equation, 9 figures, 5 tables)

This paper contains 20 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Human-in-the-loop workflow for augmenting text annotation with generative LLMs
  • Figure 2: Precision and recall for 27 annotation tasks compared to human labels. Points sharing the same color are conducted on the same text data set.
  • Figure 3: Change in LLM annotation performance on training data after one round of prompt updates.
  • Figure 4: Relationship between consistency score and accuracy, TPR, and TNR. Lines are linear trend trends weighted by the number of samples that fall under each density score.
  • Figure 5: Relationship between temperature and F1 score.
  • ...and 4 more figures