Table of Contents
Fetching ...

Balancing Label Quantity and Quality for Scalable Elicitation

Alex Mallen, Nora Belrose

TL;DR

The microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023) are explored, and the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget by adding a few-shot prompt to make use of the model's existing knowledge of the task.

Abstract

Scalable oversight studies methods of training and evaluating AI systems in domains where human judgment is unreliable or expensive, such as scientific research and software engineering in complex codebases. Most work in this area has focused on methods of improving the quality of labels. Recent work by Burns et al. (2023) considers the complementary problem of training models with low-quality labels, finding that large pretrained models often have an inductive bias towards producing correct answers. In practice, however, neither label quantity nor quality is fixed: practitioners face a quantity-quality tradeoff. In this paper, we explore the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023). While sample-efficient learning has been studied extensively, little public research has focused on scalable elicitation: eliciting capabilities from pretrained models subject to labeling cost constraints. We find that this setting has novel dynamics caused by the tradeoff between label quantity and quality, as well as the model's existing latent capabilities. We observe three regimes of eliciting classification knowledge from pretrained models using supervised finetuning: quantity-dominant, quality-dominant, and a mixed regime involving the use of low- and high-quality data together to attain higher accuracy at a lower cost than using either alone. We explore sample-efficient elicitation methods that make use of two datasets of differing qualities, and establish a Pareto frontier of scalable elicitation methods that optimally trade off labeling cost and classifier performance. We find that the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget by adding a few-shot prompt to make use of the model's existing knowledge of the task.

Balancing Label Quantity and Quality for Scalable Elicitation

TL;DR

The microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023) are explored, and the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget by adding a few-shot prompt to make use of the model's existing knowledge of the task.

Abstract

Scalable oversight studies methods of training and evaluating AI systems in domains where human judgment is unreliable or expensive, such as scientific research and software engineering in complex codebases. Most work in this area has focused on methods of improving the quality of labels. Recent work by Burns et al. (2023) considers the complementary problem of training models with low-quality labels, finding that large pretrained models often have an inductive bias towards producing correct answers. In practice, however, neither label quantity nor quality is fixed: practitioners face a quantity-quality tradeoff. In this paper, we explore the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023). While sample-efficient learning has been studied extensively, little public research has focused on scalable elicitation: eliciting capabilities from pretrained models subject to labeling cost constraints. We find that this setting has novel dynamics caused by the tradeoff between label quantity and quality, as well as the model's existing latent capabilities. We observe three regimes of eliciting classification knowledge from pretrained models using supervised finetuning: quantity-dominant, quality-dominant, and a mixed regime involving the use of low- and high-quality data together to attain higher accuracy at a lower cost than using either alone. We explore sample-efficient elicitation methods that make use of two datasets of differing qualities, and establish a Pareto frontier of scalable elicitation methods that optimally trade off labeling cost and classifier performance. We find that the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget by adding a few-shot prompt to make use of the model's existing knowledge of the task.

Paper Structure

This paper contains 14 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the tradeoff between quantity and quality of labels for sequential SFT. We arbitrarily define the cost of a high-quality label to be $1 and the cost of weak labels to be $0.10. Points lying on the y-axis can be understood as the accuracy attained when finetuning exclusively on high-quality label for each budget. Along the x-axis, one high-quality label is given up for every 10 weak labels used. Weak labels are generated by Qwen-1.5 0.5B, and the strong model, Llama-3 8B, is sequentially trained on weak then high-quality labels. Results are averaged over 5 binary classification tasks (Hellaswag, SciQ, CosmosQA, Quail, and SocialIQA). Missing points from the curves with the highest budgets are due to some datasets not having enough examples to fill the train splits. Note that weak label accuracy is measured on the train set, which is not necessarily distributed identically to test. We see each of the three regimes. Quality-dominant (budget$\geq$$1024): No budget should be allocated to weak labels. Quantity-dominant (budget$\le$$64): All budget should be allocated to weak labels. Mixed ($256$\le$budget$<$$1024): The peak of the accuracy curve is somewhere in the middle.
  • Figure 2: Comparison between training on weak labels generated by Qwen-1.5 0.5B vs Qwen-1.5 4B at a weak marginal cost of $0.10.
  • Figure 3: Scaling trends of sequential SFT on MMLU (without early-stopping as described in Sec \ref{['scaling']}). Weak labels are 70.2% accurate and generated by davinci-002, which is less capable than Llama-3-8B. Weak labels are again assumed to cost 10 times less than high-quality labels. Errorbars are standard deviations over random seeds. We use 3 random seeds, except for training runs where the smaller stage takes less than or equal to 10 examples, in which case we use 7 random seeds. We see weak evidence corroborating prior work that suggests larger models require fewer finetuning examples to elicit their knowledge zhang2024scalingmeetsllmfinetuning. High accuracy in MMLU can be elicited from GPT-4o-mini even with 16 finetuning examples.
  • Figure 4: Few-shot-prompted SFT with various quantities of weak and high-quality labels in-context and used for SFT. The quality of in-context examples is inconsequential, while the quality of SFT examples matters substantially.
  • Figure 5: Accuracy vs cost of the top three finetuning methods, at three different weak label costs, with weak labels generated by Qwen-1.5 0.5B. Each point is the average accuracy over Hellaswag, SocialIQA, and CosmosQA. The color indicates the fraction of labels that are weak, with black indicating that exactly zero high-quality labels were used. The Pareto frontier is shown in gray. 2-shot-prompted sequential SFT makes sample-efficient use of labels, making it the most effective method for most budgets. For low budgets, few-shot prompting with weak labels is most effective.
  • ...and 3 more figures