Table of Contents
Fetching ...

Comprehensive Exploration of Synthetic Data Generation: A Survey

André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster

TL;DR

This survey compiles 417 SDG models from the past decade to map the SDG landscape, creating a taxonomy of 20 model types and 42 subtypes. It highlights a trajectory toward neural-network-based approaches, with GANs and diffusion models driving image data synthesis and RNN/transformers underpinning sequential data tasks, while privacy-preserving SDG remains nascent and measurement protocols remain inconsistent. The authors provide a practical guideline for model selection and identify key gaps, including standardized evaluation metrics, shared datasets, and explicit cost reporting. The work aims to serve researchers and practitioners by clarifying model capabilities, limitations, and trade-offs across domains, thereby accelerating informed SDG model choice and future research directions.

Abstract

Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.

Comprehensive Exploration of Synthetic Data Generation: A Survey

TL;DR

This survey compiles 417 SDG models from the past decade to map the SDG landscape, creating a taxonomy of 20 model types and 42 subtypes. It highlights a trajectory toward neural-network-based approaches, with GANs and diffusion models driving image data synthesis and RNN/transformers underpinning sequential data tasks, while privacy-preserving SDG remains nascent and measurement protocols remain inconsistent. The authors provide a practical guideline for model selection and identify key gaps, including standardized evaluation metrics, shared datasets, and explicit cost reporting. The work aims to serve researchers and practitioners by clarifying model capabilities, limitations, and trade-offs across domains, thereby accelerating informed SDG model choice and future research directions.

Abstract

Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.
Paper Structure (63 sections, 35 equations, 59 figures, 1 table, 2 algorithms)

This paper contains 63 sections, 35 equations, 59 figures, 1 table, 2 algorithms.

Figures (59)

  • Figure 1: Illustrations of GMM used for one and two-dimensional data.
  • Figure 2: Comparison between a Gaussian, GMM and a deep GMM with transformation biases $b_{i,j}$ not shown. (Source: oord2014factoring)
  • Figure 3: Illustrations of graphs of Markov chains with symbols as nodes and transition probabilities as edges. (Adapted from: durbin1998biological)
  • Figure 4: Illustrations of HMM.
  • Figure 5: Example of a BN with discrete random variables. (Source: guo2002survey)
  • ...and 54 more figures