Table of Contents
Fetching ...

Distribution-Agnostic Database De-Anonymization Under Obfuscation And Synchronization Errors

Serhat Bakirtas, Elza Erkip

TL;DR

This work presents a distribution-agnostic framework for database de-anonymization under synchronization errors and obfuscation, showing that with seeds of size $\Lambda_n=\omega(\log n)$ one can achieve a matching capacity $C=I(X;Y^S|S)$ that matches the distribution-aware benchmark. The authors introduce a distribution-agnostic noisy replica detector and a seeded deletion detector to infer the column-repetition pattern, followed by a joint-typicality based de-anonymization scheme that estimates the underlying distributions from seeds. They prove achievability and no-loss results in the asymptotic regime and provide non-asymptotic simulations demonstrating practical performance for finite databases and varying obfuscation levels. In the no-obfuscation setting, they show that repetition detection and exact sequence matching suffice with capacity $C=(1-\delta)H(X)$, underscoring the robustness of their approach and its privacy implications in practical deployments.

Abstract

Database de-anonymization typically involves matching an anonymized database with correlated publicly available data. Existing research focuses either on practical aspects without requiring knowledge of the data distribution yet provides limited guarantees, or on theoretical aspects assuming known distributions. This paper aims to bridge these two approaches, offering theoretical guarantees for database de-anonymization under synchronization errors and obfuscation without prior knowledge of data distribution. Using a modified replica detection algorithm and a new seeded deletion detection algorithm, we establish sufficient conditions on the database growth rate for successful matching, demonstrating a double-logarithmic seed size relative to row size is sufficient for detecting deletions in the database. Importantly, our findings indicate that these sufficient de-anonymization conditions are tight and are the same as in the distribution-aware setting, avoiding asymptotic performance loss due to unknown distributions. Finally, we evaluate the performance of our proposed algorithms through simulations, confirming their effectiveness in more practical, non-asymptotic, scenarios.

Distribution-Agnostic Database De-Anonymization Under Obfuscation And Synchronization Errors

TL;DR

This work presents a distribution-agnostic framework for database de-anonymization under synchronization errors and obfuscation, showing that with seeds of size one can achieve a matching capacity that matches the distribution-aware benchmark. The authors introduce a distribution-agnostic noisy replica detector and a seeded deletion detector to infer the column-repetition pattern, followed by a joint-typicality based de-anonymization scheme that estimates the underlying distributions from seeds. They prove achievability and no-loss results in the asymptotic regime and provide non-asymptotic simulations demonstrating practical performance for finite databases and varying obfuscation levels. In the no-obfuscation setting, they show that repetition detection and exact sequence matching suffice with capacity , underscoring the robustness of their approach and its privacy implications in practical deployments.

Abstract

Database de-anonymization typically involves matching an anonymized database with correlated publicly available data. Existing research focuses either on practical aspects without requiring knowledge of the data distribution yet provides limited guarantees, or on theoretical aspects assuming known distributions. This paper aims to bridge these two approaches, offering theoretical guarantees for database de-anonymization under synchronization errors and obfuscation without prior knowledge of data distribution. Using a modified replica detection algorithm and a new seeded deletion detection algorithm, we establish sufficient conditions on the database growth rate for successful matching, demonstrating a double-logarithmic seed size relative to row size is sufficient for detecting deletions in the database. Importantly, our findings indicate that these sufficient de-anonymization conditions are tight and are the same as in the distribution-aware setting, avoiding asymptotic performance loss due to unknown distributions. Finally, we evaluate the performance of our proposed algorithms through simulations, confirming their effectiveness in more practical, non-asymptotic, scenarios.
Paper Structure (17 sections, 7 theorems, 72 equations, 7 figures, 6 algorithms)

This paper contains 17 sections, 7 theorems, 72 equations, 7 figures, 6 algorithms.

Key Result

Proposition 1

(Joint AEP cover2006elements) Let $\Tilde{X}^n$ and $\Tilde{Y}^n$ be generated according to the i.i.d. marginal distributions $p_{X^n}$ and $p_{Y^n}$, independently. Then, the following holds: where $I(X;Y)\triangleq H(X)+H(Y)-H(X,Y)$ is the mutual information.

Figures (7)

  • Figure 1: An illustrative example of database matching under column repetitions. The column colored in red is deleted, whereas the column colored in blue is replicated. $Y_{i,2}^{(1)}$ and $Y_{i,2}^{(2)}$ denote noisy copies/replicas of $X_{i,2}$. The goal of database de-anonymization studied in this paper is to estimate the correct row permutation ${\sigma_n=\left(123456264135\right)}$, by matching the rows of $\mathbf{X}$ and $\mathbf{Y}$ without any prior information on the underlying database ($p_{X}$), obfuscation ($p_{Y|X}$), and repetition ($p_S$) distributions.
  • Figure 2: Relation between the anonymized database $\mathbf{X}$ and the labeled correlated database, $\mathbf{Y}$.
  • Figure 3: Hamming distances between the columns of $\mathbf{G}^{(1)}$ and $\mathbf{G}^{(2)}$ with $n=10$, $\Tilde{K}_n=7$ and $\Lambda_n=10^4$ for $q_0\approx 0.76$ and $q_1\approx 0.92$. The $(i,j)$th element corresponds to $L_{i,j}$, with the color bar indicating the approximate values. It can be seen that there are no outliers in the $4$th, $6$th, and $10$th rows. Hence, it can be inferred that $I_{\text{del}}=(4,6,10)$.
  • Figure 4: Probability of error of the noisy replica detection algorithm (Algorithm \ref{['alg:noisyreplicadetection']}) $\kappa^{(1)}_n$ vs. the row size $m_n$ with $10^5$ trials. The y-axis is given in logarithmic scale to validate the exponential relation between the error probability and $m_n$ given in \ref{['eq:replicadetectionlast']}. Different curves correspond to different crossover probabilities $\epsilon$.
  • Figure 5: Probability of error of the modified seeded deletion detection algorithm (Algorithm \ref{['alg:modifieddeletiondetection']}) vs. the seed size $\Lambda_n$ with $\Tilde{\tau}=1.5$ and $10^4$ trials. The y-axis is given in the logarithmic domain to validate the exponential relation between the deletion detection error probability and $\Lambda_n$ given by Lemma \ref{['lem: deletion detection']}. Different curves correspond to different crossover probabilities $\epsilon$.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Remark 1
  • Definition 9
  • ...and 13 more