Table of Contents
Fetching ...

Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data

Yu Xia, Chi-Hua Wang, Joshua Mabry, Guang Cheng

TL;DR

This work tackles the challenge of evaluating synthetic retail data by proposing a model-free, data-centric framework that jointly assesses fidelity, utility, and privacy. The approach employs a Train–Holdout–Eval split to generate a synthetic dataset $S$ from training data $T$, and evaluates versus a holdout $H$ and evaluation set $E$ using metrics on distributions, joint dependencies, and downstream predictive tasks, including $DCR$ and $CCR$ for privacy. Empirical results on the Complete Journey dataset show that TabAutoDiff excels in utility and privacy, while CTGAN offers strong fidelity in joint distributions, though product association remains hard to capture across models. The framework provides a scalable benchmark for retail synthetic data, enabling safer data sharing and rapid testing of pricing, forecasting, and customer analytics while guiding future methodological improvements and domain-specific metric development.

Abstract

The evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.

Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data

TL;DR

This work tackles the challenge of evaluating synthetic retail data by proposing a model-free, data-centric framework that jointly assesses fidelity, utility, and privacy. The approach employs a Train–Holdout–Eval split to generate a synthetic dataset from training data , and evaluates versus a holdout and evaluation set using metrics on distributions, joint dependencies, and downstream predictive tasks, including and for privacy. Empirical results on the Complete Journey dataset show that TabAutoDiff excels in utility and privacy, while CTGAN offers strong fidelity in joint distributions, though product association remains hard to capture across models. The framework provides a scalable benchmark for retail synthetic data, enabling safer data sharing and rapid testing of pricing, forecasting, and customer analytics while guiding future methodological improvements and domain-specific metric development.

Abstract

The evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.
Paper Structure (26 sections, 1 equation, 6 figures, 6 tables)

This paper contains 26 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The framework diagram of our synthetic retail data evaluation pipeline. Section \ref{['subsec:data-split']} explains the purpose and method to split transaction data. Section \ref{['subsec:define-fidelity']} defines detailed metrics for fidelity assessment, i.e. Wasserstein distance, Pearson correlation, etc. Section \ref{['subsec:define-utility']} defines the tasks for utility assessment, i.e. classification accuracy, product association, etc. Section\ref{['subsec:define-privacy']} explains the metrics for privacy assessment, i.e. distance to the closest record.
  • Figure 2: Selected univariate distributions for dataset “Complete Journey” illustrate diverse distributional patterns encountered in real-world datasets (see section \ref{['paragraph:marginal-dist']}).
  • Figure 3: Marginal distribution of basket size by different household sizes. Customers with more family members in the household tend to buy more products in one visit (see section \ref{['paragraph:joint-dist']}).
  • Figure 4: Correlation of selected numeric and categorical distributions for dataset “Complete Journey” illustrating contextual relationships observed in the real-world dataset (see Sec. \ref{['paragraph:joint-dist-corr']}).
  • Figure 5: Distribution of feature columns from the training dataset, holdout dataset, and synthetic datasets, as well as the corresponding distribution difference to the one observed in the training dataset. The figure contains a primitive numerical column (Quantity), a derived numerical column (Basket Size), and primitive categorical columns (Age, Household Size), see \ref{['paragraph:marginal-dist-analysis']}. Synthetic data generated by TabAutoDiff demonstrates feature distributions that closely mirror the ones of the original training dataset.
  • ...and 1 more figures