Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Nicholas Pangakis; Samuel Wolken

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Nicholas Pangakis, Samuel Wolken

TL;DR

This work examines whether surrogate training labels generated by large language models (LLMs) can replace human annotations for supervised text classification in computational social science. By implementing a four-step workflow and evaluating 14 tasks across password-protected datasets, the authors show that models fine-tuned on GPT-generated labels achieve performance close to those trained on human labels, with a median F1 gap of $0.039$. GPT-4 few-shot labels offer comparable results to GPT-generated labels, while recall remains high for LLM-based approaches, though precision may lag behind human-labeled models. The study demonstrates significant potential for fast, cost-efficient classifier development using automated annotation, but also emphasizes the need for human validation to guard against biases and task-specific limitations. Overall, LLM-generated labeling can substantially reduce annotation costs in CSS while maintaining competitive predictive performance, provided rigorous validation and ethical safeguards are maintained.

Abstract

Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

TL;DR

. GPT-4 few-shot labels offer comparable results to GPT-generated labels, while recall remains high for LLM-based approaches, though precision may lag behind human-labeled models. The study demonstrates significant potential for fast, cost-efficient classifier development using automated annotation, but also emphasizes the need for human validation to guard against biases and task-specific limitations. Overall, LLM-generated labeling can substantially reduce annotation costs in CSS while maintaining competitive predictive performance, provided rigorous validation and ethical safeguards are maintained.

Abstract

Paper Structure (20 sections, 2 equations, 6 figures, 8 tables)

This paper contains 20 sections, 2 equations, 6 figures, 8 tables.

Introduction
Methodology
Results
Discussion
Limitations
Ethics Statement
Appendix: Prior automated annotation research in computational social science
Overview of automated annotation research
Costs associated with implementing automated annotation
Appendix: Data sets
Appendix: Additional methodological details
Prompt tuning
Hyperparameter tuning, evaluation, and compute details
Additional details on human annotation procedures
Appendix: Extended results
...and 5 more sections

Figures (6)

Figure 1: Supervised text classification with LLM-generated training labels.
Figure 2: Box plots of performance on test data across 14 tasks. Thick vertical line denotes median. Color represents model type, with green corresponding to models fine-tuned on 1,000 human labels, orange to 250 human labels, red to 1,000 GPT labels, and blue to a few-shot model.
Figure A3: Change in LLM annotation performance on training data after one round of prompt optimization
Figure A4: Precision-recall curves across each BERT-family model
Figure A5: Box plots of ablation performance on test data across 14 tasks. Thick vertical line denotes median.
...and 1 more figures

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

TL;DR

Abstract

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (6)