Table of Contents
Fetching ...

Defining 'Good': Evaluation Framework for Synthetic Smart Meter Data

Sheng Chai, Gus Chadney, Charlot Avery, Phil Grunewald, Pascal Van Hentenryck, Priya L. Donti

TL;DR

It is shown that standard privacy attack methods like reconstruction or membership inference attacks are inadequate for assessing privacy risks of smart meter datasets, and an improved method is proposed by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers.

Abstract

Access to granular demand data is essential for the net zero transition; it allows for accurate profiling and active demand management as our reliance on variable renewable generation increases. However, public release of this data is often impossible due to privacy concerns. Good quality synthetic data can circumnavigate this issue. Despite significant research on generating synthetic smart meter data, there is still insufficient work on creating a consistent evaluation framework. In this paper, we investigate how common frameworks used by other industries leveraging synthetic data, can be applied to synthetic smart meter data, such as fidelity, utility and privacy. We also recommend specific metrics to ensure that defining aspects of smart meter data are preserved and test the extent to which privacy can be protected using differential privacy. We show that standard privacy attack methods like reconstruction or membership inference attacks are inadequate for assessing privacy risks of smart meter datasets. We propose an improved method by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers. The choice of $ε$ (a metric of privacy loss) significantly impacts privacy risk, highlighting the necessity of performing these explicit privacy tests when making trade-offs between fidelity and privacy.

Defining 'Good': Evaluation Framework for Synthetic Smart Meter Data

TL;DR

It is shown that standard privacy attack methods like reconstruction or membership inference attacks are inadequate for assessing privacy risks of smart meter datasets, and an improved method is proposed by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers.

Abstract

Access to granular demand data is essential for the net zero transition; it allows for accurate profiling and active demand management as our reliance on variable renewable generation increases. However, public release of this data is often impossible due to privacy concerns. Good quality synthetic data can circumnavigate this issue. Despite significant research on generating synthetic smart meter data, there is still insufficient work on creating a consistent evaluation framework. In this paper, we investigate how common frameworks used by other industries leveraging synthetic data, can be applied to synthetic smart meter data, such as fidelity, utility and privacy. We also recommend specific metrics to ensure that defining aspects of smart meter data are preserved and test the extent to which privacy can be protected using differential privacy. We show that standard privacy attack methods like reconstruction or membership inference attacks are inadequate for assessing privacy risks of smart meter datasets. We propose an improved method by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers. The choice of (a metric of privacy loss) significantly impacts privacy risk, highlighting the necessity of performing these explicit privacy tests when making trade-offs between fidelity and privacy.
Paper Structure (46 sections, 1 equation, 15 figures)

This paper contains 46 sections, 1 equation, 15 figures.

Figures (15)

  • Figure 1: Defining the privacy risk in reconstruction attacks using the ratio to the vector norm. $x_1$, $x_2$ are synthetic data points, $x_{\text{outlier}}$ is an artificial outlier. The threshold radius is the distance boundary within which an outlier is considered to be reconstructed. In this example, $x_1$ falls within the threshold radius; thus the outlier is considered to be reconstructed. Data point $x_2$ does not fall within the threshold radius.
  • Figure 2: Left: MIA using discriminator with False label for holdout samples and True label for synthetic sample. Right: MIA using a GAN’s discriminator to tell between holdout vs synthetic samples.
  • Figure 3: Comparison of consumption quantiles in real smart meter data and synthetic smart meter generated using CNZ's Faraday model faraday_paper. Quantile values are calculated at each half-hour window for real and synthetic datasets. Plots show the half-hourly settlement period versus consumption (kWh).
  • Figure 4: Example of PCA and T-SNE plots of synthetic vs real datasets based on CNZ’s Faraday outputs.
  • Figure 5: Distribution of Training (Blue) and Holdout (Orange) sets of Daily and Weekly load profiles.
  • ...and 10 more figures