Table of Contents
Fetching ...

Synthetic Data Applications in Finance

Vamsi K. Potluru, Daniel Borrajo, Andrea Coletta, Niccolò Dalmasso, Yousef El-Laham, Elizabeth Fons, Mohsen Ghassemi, Sriram Gopalakrishnan, Vikesh Gosai, Eleonora Kreačić, Ganapathy Mani, Saheed Obitayo, Deepak Paramanand, Natraj Raman, Mikhail Solonin, Srijan Sood, Svitlana Vyetrenko, Haibei Zhu, Manuela Veloso, Tucker Balch

TL;DR

This paper surveys synthetic data applications in finance across multiple data modalities, including tabular, time-series, event-series, and unstructured formats, and emphasizes regulatory and privacy considerations. It surveys generation techniques, from model-based simulators like ABIDES to neural generators such as CTGAN and TimeGAN, and proposes a privacy-level framework to guide safe deployment. The work highlights metrics for fidelity, utility, and privacy, and discusses data-liberation, augmentation, and counterfactual testing as core use-cases, illustrated by fraud detection, marketing journeys, and market-simulation case studies. It concludes with open challenges and directions, underscoring the potential of synthetic data to enable robust testing, safer data sharing, and improved decision-making in finance while acknowledging regulatory, ethical, and practical hurdles.

Abstract

Synthetic data has made tremendous strides in various commercial settings including finance, healthcare, and virtual reality. We present a broad overview of prototypical applications of synthetic data in the financial sector and in particular provide richer details for a few select ones. These cover a wide variety of data modalities including tabular, time-series, event-series, and unstructured arising from both markets and retail financial applications. Since finance is a highly regulated industry, synthetic data is a potential approach for dealing with issues related to privacy, fairness, and explainability. Various metrics are utilized in evaluating the quality and effectiveness of our approaches in these applications. We conclude with open directions in synthetic data in the context of the financial domain.

Synthetic Data Applications in Finance

TL;DR

This paper surveys synthetic data applications in finance across multiple data modalities, including tabular, time-series, event-series, and unstructured formats, and emphasizes regulatory and privacy considerations. It surveys generation techniques, from model-based simulators like ABIDES to neural generators such as CTGAN and TimeGAN, and proposes a privacy-level framework to guide safe deployment. The work highlights metrics for fidelity, utility, and privacy, and discusses data-liberation, augmentation, and counterfactual testing as core use-cases, illustrated by fraud detection, marketing journeys, and market-simulation case studies. It concludes with open challenges and directions, underscoring the potential of synthetic data to enable robust testing, safer data sharing, and improved decision-making in finance while acknowledging regulatory, ethical, and practical hurdles.

Abstract

Synthetic data has made tremendous strides in various commercial settings including finance, healthcare, and virtual reality. We present a broad overview of prototypical applications of synthetic data in the financial sector and in particular provide richer details for a few select ones. These cover a wide variety of data modalities including tabular, time-series, event-series, and unstructured arising from both markets and retail financial applications. Since finance is a highly regulated industry, synthetic data is a potential approach for dealing with issues related to privacy, fairness, and explainability. Various metrics are utilized in evaluating the quality and effectiveness of our approaches in these applications. We conclude with open directions in synthetic data in the context of the financial domain.
Paper Structure (75 sections, 1 equation, 19 figures, 13 tables)

This paper contains 75 sections, 1 equation, 19 figures, 13 tables.

Figures (19)

  • Figure 1: (Left) A Markov model in RDDL sanner2010relational, (Right) A Multi-Agent Market Simulator byrd2019abides.
  • Figure 2: Privacy Level 1: Obscure PII
  • Figure 3: Privacy Level 2: Obscure PII + noise
  • Figure 4: Privacy Level 3: Generative modeling. The question mark suggests the possibility of reverse-engineering the data.
  • Figure 5: Privacy Level 4: Generative modeling + testing
  • ...and 14 more figures