Table of Contents
Fetching ...

A density ratio framework for evaluating the utility of synthetic data

Thom Benjamin Volker, Peter-Paul de Wolf, Erik-Jan van Kesteren

TL;DR

The paper tackles the challenge of evaluating synthetic data utility under privacy constraints and proposes a density ratio framework that directly compares observed and synthetic distributions through $r(\mathbf{x}) = p_{obs}(\mathbf{x}) / p_{syn}(\mathbf{x})$. Using nonparametric estimators with automatic hyperparameter tuning (via the densityratio R package), the framework yields both global divergence measures and local, pointwise diagnostics, enabling more nuanced utility assessment. Through univariate and multivariate simulations and a real CPS application, the authors show that density-ratio–based utilities can outperform traditional measures like $pMSE$ and KL divergence in ranking synthesis quality, while also providing actionable local information and potential for downstream reweighting. The work provides a practical, open-source workflow for synthetic data validation and model refinement, with broad applicability and clear avenues for future research into kernel choices, categorical handling, and privacy implications.

Abstract

Synthetic data generation is a promising technique to facilitate the use of sensitive data while mitigating the risk of privacy breaches. However, for synthetic data to be useful in downstream analysis tasks, it needs to be of sufficient quality. Various methods have been proposed to measure the utility of synthetic data, but their results are often incomplete or even misleading. In this paper, we propose using density ratio estimation to improve quality evaluation for synthetic data, and thereby the quality of synthesized datasets. We show how this framework relates to and builds on existing measures, yielding global and local utility measures that are informative and easy to interpret. We develop an estimator which requires little to no manual tuning due to automatic selection of a nonparametric density ratio model. Through simulations, we find that density ratio estimation yields more accurate estimates of global utility than established procedures. A real-world data application demonstrates how the density ratio can guide refinements of synthesis models and can be used to improve downstream analyses. We conclude that density ratio estimation is a valuable tool in synthetic data generation workflows and provide these methods in the accessible open source R-package densityratio.

A density ratio framework for evaluating the utility of synthetic data

TL;DR

The paper tackles the challenge of evaluating synthetic data utility under privacy constraints and proposes a density ratio framework that directly compares observed and synthetic distributions through . Using nonparametric estimators with automatic hyperparameter tuning (via the densityratio R package), the framework yields both global divergence measures and local, pointwise diagnostics, enabling more nuanced utility assessment. Through univariate and multivariate simulations and a real CPS application, the authors show that density-ratio–based utilities can outperform traditional measures like and KL divergence in ranking synthesis quality, while also providing actionable local information and potential for downstream reweighting. The work provides a practical, open-source workflow for synthetic data validation and model refinement, with broad applicability and clear avenues for future research into kernel choices, categorical handling, and privacy implications.

Abstract

Synthetic data generation is a promising technique to facilitate the use of sensitive data while mitigating the risk of privacy breaches. However, for synthetic data to be useful in downstream analysis tasks, it needs to be of sufficient quality. Various methods have been proposed to measure the utility of synthetic data, but their results are often incomplete or even misleading. In this paper, we propose using density ratio estimation to improve quality evaluation for synthetic data, and thereby the quality of synthesized datasets. We show how this framework relates to and builds on existing measures, yielding global and local utility measures that are informative and easy to interpret. We develop an estimator which requires little to no manual tuning due to automatic selection of a nonparametric density ratio model. Through simulations, we find that density ratio estimation yields more accurate estimates of global utility than established procedures. A real-world data application demonstrates how the density ratio can guide refinements of synthesis models and can be used to improve downstream analyses. We conclude that density ratio estimation is a valuable tool in synthetic data generation workflows and provide these methods in the accessible open source R-package densityratio.
Paper Structure (18 sections, 11 equations, 6 figures, 2 tables)

This paper contains 18 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example of the true and estimated density ratio of two normal distributions with different means and variances (i.e., $p_{\text{syn}}(\mathbf{x}) = N(0,1)$ and $p_{\text{obs}}(\mathbf{x}) = N(1,2)$). The function $r(\mathbf{x}) = p_{\text{obs}}(\mathbf{x})/p_{\text{syn}}(\mathbf{x})$ denotes the true density ratio, the function $\hat{r}(\mathbf{x})$ denotes an estimate of the density ratio based on $n_{\text{syn}} = n_{\text{obs}} = 200$ samples from each distribution obtained with unconstrained Least-Squares Importance Fitting (uLSIF). Note that the density ratio is itself not a proper density.
  • Figure 2: True and synthetic data densities for the four simulations with Laplace, Log-normal, location-scale $t$- and Normal densities. All data-generating mechanisms have the same mean $\mu = 1$ and variance $\sigma^2 = 2$. Note that the true and synthetic data density in the bottom right panel are completely overlapping.
  • Figure 3: Estimated density ratios by unconstrained least-squares importance fitting in four univariate examples: A Laplace distribution, a log-normal distribution, a location-scale $t$-distribution and a normal distribution, all approximated by a normal distribution with the same mean and variance as the sample from the true distribution. Note that the mass of the synthetic data distribution in the tails (smaller than $-2$ or greater than $4$) is smaller than $0.034$.
  • Figure 4: Utility measures for $n = 1000$ and $D = 25$ for $PE$, $KL_{\sqrt{n}}$ and $pMSE_{\text{cart}}$ for $1000$ simulations of the three synthetic data models.
  • Figure 5: Real and synthetic data distributions for the variables household property taxes and social security benefits (social security) for the transformed and semi-continuous synthesis strategy. The panel titles display the estimated Pearson divergence for each variable between the observed and synthetic data, estimated using a density ratio model for each variable separately. Note that the y-axis is displayed on a square-root scale to enhance visibility.
  • ...and 1 more figures