Table of Contents
Fetching ...

DataGen: Unified Synthetic Dataset Generation via Large Language Models

Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Jianfeng Gao, Chaowei Xiao, Lichao Sun, Xiangliang Zhang

TL;DR

DataGen presents a unified, LLM-powered framework to generate high-quality synthetic textual datasets by addressing generalization, controllability, diversity, and truthfulness. It combines input specification, attribute-guided generation, self-evaluation, code-based verification, RAG-based truthfulness checks, and post-processing to produce controllable, diverse data. The framework is validated across multiple datasets and LLMs, demonstrating improvements in data quality and diversity, while enabling practical applications in LLM benchmarking and data augmentation. The results provide insights into module contributions, cost efficiency, and the potential of synthetic data to improve model capabilities, with limitations and future directions discussed.

Abstract

Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DataGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DataGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by DataGen, and each module within DataGen plays a critical role in this enhancement. Additionally, DataGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that DataGen effectively supports dynamic and evolving benchmarking and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

DataGen: Unified Synthetic Dataset Generation via Large Language Models

TL;DR

DataGen presents a unified, LLM-powered framework to generate high-quality synthetic textual datasets by addressing generalization, controllability, diversity, and truthfulness. It combines input specification, attribute-guided generation, self-evaluation, code-based verification, RAG-based truthfulness checks, and post-processing to produce controllable, diverse data. The framework is validated across multiple datasets and LLMs, demonstrating improvements in data quality and diversity, while enabling practical applications in LLM benchmarking and data augmentation. The results provide insights into module contributions, cost efficiency, and the potential of synthetic data to improve model capabilities, with limitations and future directions discussed.

Abstract

Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DataGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DataGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by DataGen, and each module within DataGen plays a critical role in this enhancement. Additionally, DataGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that DataGen effectively supports dynamic and evolving benchmarking and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

Paper Structure

This paper contains 34 sections, 1 equation, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Our proposed DataGen for dataset generation via LLMs.
  • Figure 2: The architecture of DataGen.
  • Figure 3: Human evaluation of overall quality assessment and enhancement.
  • Figure 4: Length and the self-BLEU score of generated data and original data.
  • Figure 5: Semantic embedding of different datasets. We use OpenAI's text-embedding-ada-002text-embedding-ada-002 to obtain text embedding.
  • ...and 9 more figures