Table of Contents
Fetching ...

A Survey on Data Synthesis and Augmentation for Large Language Models

Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang

TL;DR

This survey addresses the data bottleneck in scaling LLMs by reviewing data synthesis and augmentation techniques spanning the full lifecycle of LLMs and across core functionalities like understanding, reasoning, memory, and generation. It introduces a two-dimensional taxonomy separating augmentation from synthesis and maps methods to lifecycle stages (data preparation to applications) and functional goals (e.g., domain-specific distillation, self-improvement, and multimodal data). Key contributions include a comprehensive taxonomy, a lifecycle-oriented synthesis framework, and insights into challenges, evaluation, and future directions for data-centric LLM development. The findings highlight that combining general and domain-specific distillation with self-improvement and robust evaluation can meaningfully extend data efficiency, improve task generalization, and enable scalable, multi-domain LLM capabilities while underscoring ethical and security considerations. Overall, the work provides a structured blueprint for researchers to select data-generation strategies aligned with their modeling goals and deployment contexts.

Abstract

The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.

A Survey on Data Synthesis and Augmentation for Large Language Models

TL;DR

This survey addresses the data bottleneck in scaling LLMs by reviewing data synthesis and augmentation techniques spanning the full lifecycle of LLMs and across core functionalities like understanding, reasoning, memory, and generation. It introduces a two-dimensional taxonomy separating augmentation from synthesis and maps methods to lifecycle stages (data preparation to applications) and functional goals (e.g., domain-specific distillation, self-improvement, and multimodal data). Key contributions include a comprehensive taxonomy, a lifecycle-oriented synthesis framework, and insights into challenges, evaluation, and future directions for data-centric LLM development. The findings highlight that combining general and domain-specific distillation with self-improvement and robust evaluation can meaningfully extend data efficiency, improve task generalization, and enable scalable, multi-domain LLM capabilities while underscoring ethical and security considerations. Overall, the work provides a structured blueprint for researchers to select data-generation strategies aligned with their modeling goals and deployment contexts.

Abstract

The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.

Paper Structure

This paper contains 50 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Statistics of the publications related to LLM-oriented data synthesis and augmentation technologies, grouped by the publication year and venue.
  • Figure 2: A comparison between existing surveys on data synthesis and augmentation techniques and our work. Previous surveys primarily focus on LLM-based data synthesis and augmentation methods aimed at supporting downstream tasks. In contrast, our work emphasizes LLM-oriented data synthesis and augmentation, systematically covering the full lifecycle of LLMs—from data preparation to applications—and addressing core LLM functions such as understanding and generation, with the ultimate goal of improving LLMs themselves through data-centric techniques.
  • Figure 3: The main content flow and categorization of this survey.
  • Figure 4: Illustration of the evolutionary steps in the development of data synthesis and augmentation techniques for large models.