Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data
Yu Xia, Chi-Hua Wang, Joshua Mabry, Guang Cheng
TL;DR
This work tackles the challenge of evaluating synthetic retail data by proposing a model-free, data-centric framework that jointly assesses fidelity, utility, and privacy. The approach employs a Train–Holdout–Eval split to generate a synthetic dataset $S$ from training data $T$, and evaluates versus a holdout $H$ and evaluation set $E$ using metrics on distributions, joint dependencies, and downstream predictive tasks, including $DCR$ and $CCR$ for privacy. Empirical results on the Complete Journey dataset show that TabAutoDiff excels in utility and privacy, while CTGAN offers strong fidelity in joint distributions, though product association remains hard to capture across models. The framework provides a scalable benchmark for retail synthetic data, enabling safer data sharing and rapid testing of pricing, forecasting, and customer analytics while guiding future methodological improvements and domain-specific metric development.
Abstract
The evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.
