Table of Contents
Fetching ...

Asymptotic utility of spectral anonymization

Katariina Perkonoja, Joni Virta

TL;DR

This paper analyzes the utility and privacy of spectral anonymization (SA) in an asymptotic setting, introducing two variants, $\mathcal{J}$-SA and $\mathcal{O}$-SA, in addition to the original $\mathcal{P}$-SA. It derives mean and covariance estimation results: $\mathcal{P}$-SA retains mean efficiency equal to the original data while $\mathcal{J}$-SA and $\mathcal{O}$-SA have $1/2$ the efficiency; all SA variants share an asymptotic covariance-estimation efficiency of $50\%$ relative to the original, with $p=2$ illustrating inflated cross-term variances. A simulation study and a privacy assessment via distance-based linkage show that no method dominates in finite samples; however, $\mathcal{O}$-SA provides the strongest privacy (no identical records) at the cost of higher computation, while $\mathcal{P}$-SA offers faster processing and strong mean-estimation efficiency. The results rely on eigenvalue distinctness and Gaussianity assumptions, and the authors discuss extensions to non-Gaussian data and comparisons with alternative anonymization approaches.

Abstract

In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version $\mathcal{P}$-SA, employing random permutation transformation, we introduce two novel SA variants: $\mathcal{J}$-spectral anonymization and $\mathcal{O}$-spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, $\mathcal{O}$-SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, $\mathcal{P}$-SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation.

Asymptotic utility of spectral anonymization

TL;DR

This paper analyzes the utility and privacy of spectral anonymization (SA) in an asymptotic setting, introducing two variants, -SA and -SA, in addition to the original -SA. It derives mean and covariance estimation results: -SA retains mean efficiency equal to the original data while -SA and -SA have the efficiency; all SA variants share an asymptotic covariance-estimation efficiency of relative to the original, with illustrating inflated cross-term variances. A simulation study and a privacy assessment via distance-based linkage show that no method dominates in finite samples; however, -SA provides the strongest privacy (no identical records) at the cost of higher computation, while -SA offers faster processing and strong mean-estimation efficiency. The results rely on eigenvalue distinctness and Gaussianity assumptions, and the authors discuss extensions to non-Gaussian data and comparisons with alternative anonymization approaches.

Abstract

In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version -SA, employing random permutation transformation, we introduce two novel SA variants: -spectral anonymization and -spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, -SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, -SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation.
Paper Structure (10 sections, 3 theorems, 20 equations, 6 figures)

This paper contains 10 sections, 3 theorems, 20 equations, 6 figures.

Key Result

theorem thmcountertheorem

Under Assumption assu:eigenvalues, we have the following, as $n \rightarrow \infty$.

Figures (6)

  • Figure 1: Illustration of the effects of $\mathcal{P}$-SA (blue), $\mathcal{J}$-SA (orange) and $\mathcal{O}$-SA (yellow) to the original data (green/darkest), with numerical labels corresponding to the row indices of original data, in a scenario with $n = 10$ and $p = 2$. $\mathcal{P}$-SA and $\mathcal{J}$-SA occasionally result in unwanted overlap, wherein a row in the anonymized data matches another row in the original dataset. In the case of $\mathcal{J}$-SA, this occurrence is due to coincidental alignment of signs with those in the original singular vectors, causing the observation to be duplicated (overlap of same indices), whereas for $\mathcal{P}$-SA, random permutation of values may cause another row to align with one in the original dataset (overlap of different indices). Conversely, $\mathcal{O}$-SA induces arbitrary rotations in the spectral space that prevent exact matches altogether with probability 1.
  • Figure 2: The relative error of the empirical covariance matrices of sample mean (solid line) and sample covariance (dotted line) compared to their asymptotic covariance matrices for $\mathcal{P}$-SA (blue), $\mathcal{J}$-SA (orange), and $\mathcal{O}$-SA (yellow), when original data (green/darkest) is sampled from a normal distribution meeting the Assumption \ref{['assu:eigenvalues']}. Sample size is presented on a logarithmic scale.
  • Figure 3: The relative error of the empirical covariance matrices of sample mean (solid line) and sample covariance (dotted line) compared to their asymptotic covariance matrices for $\mathcal{P}$-SA (blue), $\mathcal{J}$-SA (orange), and $\mathcal{O}$-SA (yellow), when original data (green/darkest) is sampled from a normal distribution violating the Assumption \ref{['assu:eigenvalues']}. Sample size is presented on a logarithmic scale.
  • Figure 4: $\text{EUC}_{\mathcal{A}}$ when the original data is sampled from normal distribution meeting Assumption \ref{['assu:eigenvalues']}. The y-axis illustrates the mean distance $\text{EUC}_{\mathcal{A}}$ between records in anonymized data and any record in the original data across all datasets, while the x-axis denotes the sample size on a logarithmic scale. The number of variables $p$ is distinguished by different linetypes and each anonymization approach is represented by a unique color.
  • Figure 5: Histograms of match proportions when the original data is sampled from normal distribution meeting Assumption \ref{['assu:eigenvalues']}. The y-axis represents the number of simulated datasets while the x-axis indicates the proportion of matches relative to the sample size, with different SAs represented by distinct colors. Sample sizes $n > 400$ have been omitted here, as they introduce no new information.
  • ...and 1 more figures

Theorems & Definitions (5)

  • theorem thmcountertheorem
  • theorem thmcountertheorem
  • proof : Proof of Theorem \ref{['theo:means']}
  • lemma thmcounterlemma
  • proof : Proof of Theorem \ref{['theo:covariances']}