Table of Contents
Fetching ...

The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, Luca Maria Aiello

TL;DR

This study compares human-labeled data versus LLM-generated augmentation (GPT-4 and Llama-2) for ten Computational Social Science classification tasks under varying data sizes. Using a fixed 110M-parameter classifier, it augments a 10% crowdsourced base with nine synthetic examples per sample and benchmarks against zero-shot LLM predictions. Results show human labels outperform synthetic augmentation on binary and balanced tasks, while synthetic data helps primarily for rare classes in complex, unbalanced multi-class tasks; zero-shot models lag behind specialized trained models in most settings. The paper offers guidelines for CSS practitioners emphasizing systematic data-quality evaluation and standardized prompt design, while acknowledging limitations in resources, safety, and distributional shifts.

Abstract

In the realm of Computational Social Science (CSS), practitioners often navigate complex, low-resource domains and face the costly and time-intensive challenges of acquiring and annotating data. We aim to establish a set of guidelines to address such challenges, comparing the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks of varying complexity. Additionally, we examine the impact of training data sizes on performance. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts. Nevertheless, synthetic augmentation proves beneficial, particularly in improving performance on rare classes within multi-class tasks. Furthermore, we leverage GPT-4 and Llama-2 for zero-shot classification and find that, while they generally display strong performance, they often fall short when compared to specialized classifiers trained on moderately sized training sets.

The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

TL;DR

This study compares human-labeled data versus LLM-generated augmentation (GPT-4 and Llama-2) for ten Computational Social Science classification tasks under varying data sizes. Using a fixed 110M-parameter classifier, it augments a 10% crowdsourced base with nine synthetic examples per sample and benchmarks against zero-shot LLM predictions. Results show human labels outperform synthetic augmentation on binary and balanced tasks, while synthetic data helps primarily for rare classes in complex, unbalanced multi-class tasks; zero-shot models lag behind specialized trained models in most settings. The paper offers guidelines for CSS practitioners emphasizing systematic data-quality evaluation and standardized prompt design, while acknowledging limitations in resources, safety, and distributional shifts.

Abstract

In the realm of Computational Social Science (CSS), practitioners often navigate complex, low-resource domains and face the costly and time-intensive challenges of acquiring and annotating data. We aim to establish a set of guidelines to address such challenges, comparing the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks of varying complexity. Additionally, we examine the impact of training data sizes on performance. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts. Nevertheless, synthetic augmentation proves beneficial, particularly in improving performance on rare classes within multi-class tasks. Furthermore, we leverage GPT-4 and Llama-2 for zero-shot classification and find that, while they generally display strong performance, they often fall short when compared to specialized classifiers trained on moderately sized training sets.
Paper Structure (11 sections, 4 figures, 2 tables)

This paper contains 11 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Experimental framework. For each dataset, we start from a base set ($10\%$ crowdsourced samples) and augment it either by adding manually labeled samples or synthetic samples obtained with LLMs. Augmented training sets of different sizes are used to train classifiers. Models are tested on a holdout set and compared to zero-shot approaches.
  • Figure 2: Data augmentation experiment. Macro F1 score on the test set for the ten classification tasks, given various training data sizes and augmentation strategies. Y-axis scales are defined differently for each task to enhance clarity. Each set of training samples contains $10\%$ crowdsourced samples (base set). The dashed line represents the zero-shot performance of LLMs. Each experiment undergoes 5 runs of training with different data sampling seeds and confidence intervals around average metric values are shown. Tasks are grouped by complexity levels (cf. icon tags defined in Table \ref{['tab:task-difficulties']}) and sorted within each group by the relative improvement in performance between crowdsourced-based and other types of training.
  • Figure 3: Class distribution per task.
  • Figure 4: Lexical and semantic diversity between original and synthetically generated data for GPT-4 and Llama-2 models. We also include similarity between random samples of original and augmented data within each task, denoted as baseline. Synthetic data for the offensiveness task could not be generated via Llama-2.