Table of Contents
Fetching ...

Want To Reduce Labeling Cost? GPT-3 Can Help

Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng

TL;DR

The paper tackles the high cost of data labeling in NLP by using GPT-3 as a scalable, low-cost labeler to train smaller downstream models. It introduces a cost-aware labeling framework, including a detailed cost analysis, GPT-3 labeling, and a dual human-GPT-3 supervision scheme with active labeling. Empirical results across 9 NLP tasks show 50–96% cost savings while achieving comparable or better performance than human-labeled baselines, with mix-and-match strategies often yielding the best results. The findings highlight practical, scalable strategies for cost-efficient data labeling and provide theoretical and empirical justification for when GPT-3 labels can outperform raw GPT-3 in few-shot regimes.

Abstract

Data annotation is a time-consuming and labor-intensive process for many NLP tasks. Although there exist various methods to produce pseudo data labels, they are often task-specific and require a decent amount of labeled data to start with. Recently, the immense language model GPT-3 with 175 billion parameters has achieved tremendous improvement across many few-shot learning tasks. In this paper, we explore ways to leverage GPT-3 as a low-cost data labeler to train other models. We find that, to make the downstream model achieve the same performance on a variety of NLU and NLG tasks, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans. Furthermore, we propose a novel framework of combining pseudo labels from GPT-3 with human labels, which leads to even better performance with limited labeling budget. These results present a cost-effective data labeling methodology that is generalizable to many practical applications.

Want To Reduce Labeling Cost? GPT-3 Can Help

TL;DR

The paper tackles the high cost of data labeling in NLP by using GPT-3 as a scalable, low-cost labeler to train smaller downstream models. It introduces a cost-aware labeling framework, including a detailed cost analysis, GPT-3 labeling, and a dual human-GPT-3 supervision scheme with active labeling. Empirical results across 9 NLP tasks show 50–96% cost savings while achieving comparable or better performance than human-labeled baselines, with mix-and-match strategies often yielding the best results. The findings highlight practical, scalable strategies for cost-efficient data labeling and provide theoretical and empirical justification for when GPT-3 labels can outperform raw GPT-3 in few-shot regimes.

Abstract

Data annotation is a time-consuming and labor-intensive process for many NLP tasks. Although there exist various methods to produce pseudo data labels, they are often task-specific and require a decent amount of labeled data to start with. Recently, the immense language model GPT-3 with 175 billion parameters has achieved tremendous improvement across many few-shot learning tasks. In this paper, we explore ways to leverage GPT-3 as a low-cost data labeler to train other models. We find that, to make the downstream model achieve the same performance on a variety of NLU and NLG tasks, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans. Furthermore, we propose a novel framework of combining pseudo labels from GPT-3 with human labels, which leads to even better performance with limited labeling budget. These results present a cost-effective data labeling methodology that is generalizable to many practical applications.

Paper Structure

This paper contains 28 sections, 1 theorem, 4 equations, 6 figures, 1 table.

Key Result

Theorem 2

Suppose $\hat{G} \in \mathcal{G}$ is the classifier that minimizes its discrepancy with GPT-3 over the input space $\mathcal{X}$. Let $\bar{a}$ be the maximum error of GPT-3 on any class $P_i$. If $P$ satisfies $(\bar{a}, \bar{c})$-expansion, then we have where $c=\min\{1/\bar{a}, \bar{c}\}$.

Figures (6)

  • Figure 1: Two examples of constructing GPT-3 input. The input prompt of GPT-3 consists of $n$ labeled data ($n$-shot learning) and the task input for which GPT-3 generates the label. The same $n$ labeled data is used for every input.
  • Figure 2: Four data labeling strategies given a fixed budget. a) label data by human only, b) label data by GPT-3 only, c) randomly select non-overlapped data according to a split ratio of budget for human and GPT-3 to label, d) select GPT-3 labeled data with lower confidence scores for humans to re-label.
  • Figure 3: Performance v.s. labeling cost of various labeling strategies on 9 NLG and NLU datasets. X-axis is the cost in dollar estimated by OpenAI pricing policy and crowd-sourced annotation. Each point is the average result of 3 runs of PEGASUS (NLG) or RoBERTa$_{large}$ (NLU) using 3 sets of generated labels, with the standard deviation shown. The performance of using GPT-3 as the inference model is shown as a dashed line, which is the maximum ROUGE-L/accuracy over different shot settings. Note that the cost of GPT3-Label and GPT3-Human-Label cannot further increase when all training data (up to 5,120 instances) has been labeled.
  • Figure 4: GPT-3 labeling performance. We feed un-labeled data to GPT-3 with different shot settings and fine-tune Transformer models on the corresponding labeled data. The dot lines are the raw GPT-3 performance with various shots. Lines in the same color use the same number of shots in GPT-3. The cost of GPT3-Label cannot further increase when all training data (up to 5,120 instances) has been labeled.
  • Figure 5: Active labeling. The first row shows that logit values from GPT-3 can be treated as confidence scores, and high-confidence labels are much more accurate than low-confidence ones. The second row compares the performance of active labeling and random labeling in GPT3-Human strategy on three different NLU datasets.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Consistency assumption
  • Theorem 2
  • Definition 3: $(a, c)$-expansion, wei2020theoretical