Table of Contents
Fetching ...

TarGEN: Targeted Data Generation with Large Language Models

Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra

TL;DR

TarGEN introduces a seedless, multi-step prompting framework for generating high-quality synthetic data with LLMs, augmented by a self-correction module to ensure accurate labels. By formulating data synthesis as label-constrained generation from task descriptions, TarGEN achieves diverse yet reliable datasets, evaluated on eight SuperGLUE tasks. Empirical results show models trained on TarGEN data can match or exceed performance with original data, especially when combined with instruction tuning and multi-task training, while analyses confirm increased diversity and comparable bias. The approach reduces human data curation effort and demonstrates strong potential for domain-specific benchmark generation and broader applicability across tasks and models.

Abstract

The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversity and added noise. In this paper, we present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets utilizing a LLM. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique's effectiveness, we emulate 8 tasks from the SuperGLUE benchmark and finetune various language models, including encoder-only, encoder-decoder, and decoder-only models on both synthetic and original training sets. Evaluation on the original test set reveals that models trained on datasets generated by TarGEN perform approximately 1-2% points better than those trained on original datasets (82.84% via syn. vs. 81.12% on og. using Flan-T5). When incorporating instruction tuning, the performance increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals that the synthetic dataset demonstrates similar or higher levels of dataset complexity and diversity. Furthermore, the synthetic dataset displays a bias level that aligns closely with the original dataset. Finally, when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive results on the OpenLLM leaderboard, surpassing the model trained on the Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for quality data generation and reducing the human efforts to create complex benchmarks.

TarGEN: Targeted Data Generation with Large Language Models

TL;DR

TarGEN introduces a seedless, multi-step prompting framework for generating high-quality synthetic data with LLMs, augmented by a self-correction module to ensure accurate labels. By formulating data synthesis as label-constrained generation from task descriptions, TarGEN achieves diverse yet reliable datasets, evaluated on eight SuperGLUE tasks. Empirical results show models trained on TarGEN data can match or exceed performance with original data, especially when combined with instruction tuning and multi-task training, while analyses confirm increased diversity and comparable bias. The approach reduces human data curation effort and demonstrates strong potential for domain-specific benchmark generation and broader applicability across tasks and models.

Abstract

The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversity and added noise. In this paper, we present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets utilizing a LLM. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique's effectiveness, we emulate 8 tasks from the SuperGLUE benchmark and finetune various language models, including encoder-only, encoder-decoder, and decoder-only models on both synthetic and original training sets. Evaluation on the original test set reveals that models trained on datasets generated by TarGEN perform approximately 1-2% points better than those trained on original datasets (82.84% via syn. vs. 81.12% on og. using Flan-T5). When incorporating instruction tuning, the performance increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals that the synthetic dataset demonstrates similar or higher levels of dataset complexity and diversity. Furthermore, the synthetic dataset displays a bias level that aligns closely with the original dataset. Finally, when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive results on the OpenLLM leaderboard, surpassing the model trained on the Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for quality data generation and reducing the human efforts to create complex benchmarks.
Paper Structure (46 sections, 10 equations, 11 figures, 16 tables)

This paper contains 46 sections, 10 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: An overview of using TarGEN to generate instances for the WiC task. We first create a set of prompts (boxes 1, 2 in figure) to generate instance seeds, or linguistic components unique to each task instance. Next, we create label-specific prompts (box 3) that generate instances based on instance seeds and the relationship implied by the label for this task. We use zero-shot LLM inference to generate an initial set of synthetic instances. The instances are then passed to our self-correction module consisting of a single meta-prompt which allows us to re-label mislabeled data instances, helping us reduce noise. Hence, based on the task description, we obtain high-quality synthetic instances to evaluate a task.
  • Figure 2: Matrices showing the effect of the self-correction step across various datasets of SuperGLUE. The row values show the number of labels that were originally assigned to that label (ent, non: entailment, non-entailment; neutr, contr: neutral, contradiction). The number in a cell $(i,j)$ reflects the number of labels originally assigned to label $i$ which were re-labeled to label $j$ after self-correction. While the majority of the instances had their labels reaffirmed by self-correction, a significant number of instances were re-labeled as a result of this step.
  • Figure 3: Comparison of semantic diversity across datasets among the original and the synthetically generated dataset. It can be seen that the original datasets' cosine similarity is higher for most tasks as compared to the synthetic datasets' which has a consistently lower cosine similarity indicating higher semantic diversity.
  • Figure 4: Comparison of dataset bias for the BoolQ dataset and the synthetically generated BoolQ dataset. This represents the distribution for GPE (Geo Political Entitiy) named entity tag. The distribution for other entities can be found in § \ref{['sec:extended_analysis_app']}
  • Figure 5: Comparison of PVI ($\mathcal{V}$-usable information) for AXG, BoolQ, and WiC original and the synthetically generated dataset. Synthetic data seems to have better quality as the original datasets' PVI is concentrated around -0.1 to 0.1 whereas the synthetic data generated has a diverse mix of difficulty level among the samples. The dataset difficulty comparison for all datasets can be found in § \ref{['fig:data_diff_comparison_full']}
  • ...and 6 more figures