Want To Reduce Labeling Cost? GPT-3 Can Help
Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng
TL;DR
The paper tackles the high cost of data labeling in NLP by using GPT-3 as a scalable, low-cost labeler to train smaller downstream models. It introduces a cost-aware labeling framework, including a detailed cost analysis, GPT-3 labeling, and a dual human-GPT-3 supervision scheme with active labeling. Empirical results across 9 NLP tasks show 50–96% cost savings while achieving comparable or better performance than human-labeled baselines, with mix-and-match strategies often yielding the best results. The findings highlight practical, scalable strategies for cost-efficient data labeling and provide theoretical and empirical justification for when GPT-3 labels can outperform raw GPT-3 in few-shot regimes.
Abstract
Data annotation is a time-consuming and labor-intensive process for many NLP tasks. Although there exist various methods to produce pseudo data labels, they are often task-specific and require a decent amount of labeled data to start with. Recently, the immense language model GPT-3 with 175 billion parameters has achieved tremendous improvement across many few-shot learning tasks. In this paper, we explore ways to leverage GPT-3 as a low-cost data labeler to train other models. We find that, to make the downstream model achieve the same performance on a variety of NLU and NLG tasks, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans. Furthermore, we propose a novel framework of combining pseudo labels from GPT-3 with human labels, which leads to even better performance with limited labeling budget. These results present a cost-effective data labeling methodology that is generalizable to many practical applications.
