Table of Contents
Fetching ...

New Money: A Systematic Review of Synthetic Data Generation for Finance

James Meldrum, Basem Suleiman, Fethi Rabhi, Muhammad Johan Alibasa

TL;DR

This systematic review maps the current state of synthetic data generation for finance, analyzing 72 studies since 2018. It reveals GAN-based methods, especially for time-series market data, as the dominant approach, with TimeGAN frequently used for market data generation. A major finding is the underrepresentation of formal privacy-preservation evaluations, despite privacy being a central concern in finance. The paper provides a structured synthesis of techniques, applications, and evaluation practices and outlines gaps and priorities to advance robust, privacy-preserving synthetic financial datasets.

Abstract

Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning applications. By leveraging generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), it is possible to create artificial datasets that preserve the statistical properties of real financial records while mitigating privacy risks and regulatory constraints. Despite the rapid growth of this field, a comprehensive synthesis of the current research landscape has been lacking. This systematic review consolidates and analyses 72 studies published since 2018 that focus on synthetic financial data generation. We categorise the types of financial information synthesised, the generative methods employed, and the evaluation strategies used to assess data utility and privacy. The findings indicate that GAN-based approaches dominate the literature, particularly for generating time-series market data and tabular credit data. While several innovative techniques demonstrate potential for improved realism and privacy preservation, there remains a notable lack of rigorous evaluation of privacy safeguards across studies. By providing an integrated overview of generative techniques, applications, and evaluation methods, this review highlights critical research gaps and offers guidance for future work aimed at developing robust, privacy-preserving synthetic data solutions for the financial domain.

New Money: A Systematic Review of Synthetic Data Generation for Finance

TL;DR

This systematic review maps the current state of synthetic data generation for finance, analyzing 72 studies since 2018. It reveals GAN-based methods, especially for time-series market data, as the dominant approach, with TimeGAN frequently used for market data generation. A major finding is the underrepresentation of formal privacy-preservation evaluations, despite privacy being a central concern in finance. The paper provides a structured synthesis of techniques, applications, and evaluation practices and outlines gaps and priorities to advance robust, privacy-preserving synthetic financial datasets.

Abstract

Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning applications. By leveraging generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), it is possible to create artificial datasets that preserve the statistical properties of real financial records while mitigating privacy risks and regulatory constraints. Despite the rapid growth of this field, a comprehensive synthesis of the current research landscape has been lacking. This systematic review consolidates and analyses 72 studies published since 2018 that focus on synthetic financial data generation. We categorise the types of financial information synthesised, the generative methods employed, and the evaluation strategies used to assess data utility and privacy. The findings indicate that GAN-based approaches dominate the literature, particularly for generating time-series market data and tabular credit data. While several innovative techniques demonstrate potential for improved realism and privacy preservation, there remains a notable lack of rigorous evaluation of privacy safeguards across studies. By providing an integrated overview of generative techniques, applications, and evaluation methods, this review highlights critical research gaps and offers guidance for future work aimed at developing robust, privacy-preserving synthetic data solutions for the financial domain.

Paper Structure

This paper contains 30 sections, 5 figures, 21 tables.

Figures (5)

  • Figure 1: Overview of the systematic review process.
  • Figure 2: PRISMA flow diagram.
  • Figure 3: Number of studies collected by published year.
  • Figure 4: Sunburst chart of generative models used in collected studies. The outer layer shows the specific methods (with the number of studies in parentheses); the inner layer shows the overarching architectures.
  • Figure 5: Example of a t-SNE plot comparing the performance of two generative techniques on stock market data Yoon2019Time-seriesNetworks.