Table of Contents
Fetching ...

Automated Annotation with Generative AI Requires Validation

Nicholas Pangakis, Samuel Wolken, Neil Fasching

TL;DR

The paper addresses the reliability of using generative AI for text annotation by proposing a task-by-task validation workflow that pits LLM-generated labels against human annotations. It validates the workflow using GPT-4 across 27 annotation tasks from 11 non-public datasets, finding promising but highly variable performance (median accuracy ~0.85, median F1 ~0.71) and a strong link between consistency and correctness. Key contributions include a practical five-step workflow, a consistency score, and open-source software to implement the approach, all aimed at ensuring reliable, cost-effective LLM-assisted annotation. The work highlights the need for careful validation and offers concrete use cases and strategies (including codebook refinement) to harness LLMs while mitigating risks of suboptimal labeling in social science text analysis.

Abstract

Generative large language models (LLMs) can be a powerful tool for augmenting text annotation procedures, but their performance varies across annotation tasks due to prompt quality, text data idiosyncrasies, and conceptual difficulty. Because these challenges will persist even as LLM technology improves, we argue that any automated annotation process using an LLM must validate the LLM's performance against labels generated by humans. To this end, we outline a workflow to harness the annotation potential of LLMs in a principled, efficient way. Using GPT-4, we validate this approach by replicating 27 annotation tasks across 11 datasets from recent social science articles in high-impact journals. We find that LLM performance for text annotation is promising but highly contingent on both the dataset and the type of annotation task, which reinforces the necessity to validate on a task-by-task basis. We make available easy-to-use software designed to implement our workflow and streamline the deployment of LLMs for automated annotation.

Automated Annotation with Generative AI Requires Validation

TL;DR

The paper addresses the reliability of using generative AI for text annotation by proposing a task-by-task validation workflow that pits LLM-generated labels against human annotations. It validates the workflow using GPT-4 across 27 annotation tasks from 11 non-public datasets, finding promising but highly variable performance (median accuracy ~0.85, median F1 ~0.71) and a strong link between consistency and correctness. Key contributions include a practical five-step workflow, a consistency score, and open-source software to implement the approach, all aimed at ensuring reliable, cost-effective LLM-assisted annotation. The work highlights the need for careful validation and offers concrete use cases and strategies (including codebook refinement) to harness LLMs while mitigating risks of suboptimal labeling in social science text analysis.

Abstract

Generative large language models (LLMs) can be a powerful tool for augmenting text annotation procedures, but their performance varies across annotation tasks due to prompt quality, text data idiosyncrasies, and conceptual difficulty. Because these challenges will persist even as LLM technology improves, we argue that any automated annotation process using an LLM must validate the LLM's performance against labels generated by humans. To this end, we outline a workflow to harness the annotation potential of LLMs in a principled, efficient way. Using GPT-4, we validate this approach by replicating 27 annotation tasks across 11 datasets from recent social science articles in high-impact journals. We find that LLM performance for text annotation is promising but highly contingent on both the dataset and the type of annotation task, which reinforces the necessity to validate on a task-by-task basis. We make available easy-to-use software designed to implement our workflow and streamline the deployment of LLMs for automated annotation.
Paper Structure (7 sections, 4 figures, 4 tables)

This paper contains 7 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Workflow for augmenting text annotation with an LLM
  • Figure 2: Precision and recall for each of 27 replicated classification tasks. Color reflects dataset, such that points sharing the same color are conducted on the same text data.
  • Figure 3: Relationship between consistency score and accuracy, TPR, and TNR.
  • Figure 4: Change in LLM annotation performance on training data after one round of codebook updates.