Table of Contents
Fetching ...

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

John Joon Young Chung, Ece Kamar, Saleema Amershi

TL;DR

The paper tackles the challenge of generating high-quality, diverse text datasets for downstream classification by combining large language model generation with human-in-the-loop interventions. It evaluates two diversification strategies, logit suppression and high-temperature sampling, and finds they increase diversity but can reduce label accuracy; two interventions—label replacement and out-of-scope filtering—are proposed to mitigate these issues, with label replacement delivering up to 14.4% absolute gains in accuracy. Oracle-like experiments show that label replacement can even allow generated data-trained models to beat GPT-3 few-shot classification on several tasks, though out-of-scope filtering is less consistently beneficial. The work provides a practical framework for human-AI data generation that improves dataset quality and suggests directions for refining human-in-the-loop strategies and evaluation in LLM-based data generation.

Abstract

Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

TL;DR

The paper tackles the challenge of generating high-quality, diverse text datasets for downstream classification by combining large language model generation with human-in-the-loop interventions. It evaluates two diversification strategies, logit suppression and high-temperature sampling, and finds they increase diversity but can reduce label accuracy; two interventions—label replacement and out-of-scope filtering—are proposed to mitigate these issues, with label replacement delivering up to 14.4% absolute gains in accuracy. Oracle-like experiments show that label replacement can even allow generated data-trained models to beat GPT-3 few-shot classification on several tasks, though out-of-scope filtering is less consistently beneficial. The work provides a practical framework for human-AI data generation that improves dataset quality and suggests directions for refining human-in-the-loop strategies and evaluation in LLM-based data generation.

Abstract

Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.
Paper Structure (41 sections, 5 equations, 10 figures, 4 tables)

This paper contains 41 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Examples of Diversification Approaches.
  • Figure 2: Impact of logit suppression and high temperatures on model accuracy, label accuracy, diversity, and similarity to the oracle dataset, averaged across eight tasks. Bars without hatches start generation without examples while those with hatches start with few-shot generation. Throughout this paper, error bars indicate 95% confidence interval.
  • Figure 3: Impact of label replacement on label accuracy and model accuracy. Throughout this paper, error areas indicate 95% confidence interval.
  • Figure 4: The ratio of instances filtered with OOSF, and its impact on model accuracy, label accuracy, diversity, and similarity, in aggregation across all tasks. As we examined the effect of OOSF with LR, for model accuracy and label accuracy, numbers left to +OOS indicate how many instances are inspected with LR.
  • Figure 5: Impact of logit suppression and high temperatures on model accuracy, label accuracy, diversity, and similarity to the oracle dataset, for each task.
  • ...and 5 more figures