Table of Contents
Fetching ...

CVPL: A Geometric Framework for Post-Hoc Linkage Risk Assessment in Protected Tabular Data

Valery Khvatov, Alexey Neyman

TL;DR

CVPL introduces a geometric, post-hoc framework for assessing residual linkage risk between original and protected tabular data. By modeling linkage as a sequence of blocking, vectorization, latent projection, and similarity evaluation, CVPL provides continuous risk surfaces R(λ, τ) that capture how protection strength and attacker strictness jointly affect feasibility of plausible links. The framework is paired with a monotonicity theorem for blocking relaxations, enabling anytime risk estimation with valid lower bounds, and is demonstrated on a 10,000-record simulation across 19 protection configurations. Empirical results show that formal k-anonymity can coexist with substantial empirical linkability, that Fellegi–Sunter can over-link under representation shifts, and that behavioral fingerprints—rather than demographics—dominate linkage risk. CVPL thus offers interpretable diagnostics for safety evaluation, mechanism comparison, and utility–risk trade-off analysis, while remaining a complement—not a replacement—to formal privacy guarantees.

Abstract

Formal privacy metrics provide compliance-oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster-Vector-Projection Linkage), a geometric framework for post-hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario-dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold-aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi-Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over-linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k-anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non-quasi-identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility-risk trade-off analysis.

CVPL: A Geometric Framework for Post-Hoc Linkage Risk Assessment in Protected Tabular Data

TL;DR

CVPL introduces a geometric, post-hoc framework for assessing residual linkage risk between original and protected tabular data. By modeling linkage as a sequence of blocking, vectorization, latent projection, and similarity evaluation, CVPL provides continuous risk surfaces R(λ, τ) that capture how protection strength and attacker strictness jointly affect feasibility of plausible links. The framework is paired with a monotonicity theorem for blocking relaxations, enabling anytime risk estimation with valid lower bounds, and is demonstrated on a 10,000-record simulation across 19 protection configurations. Empirical results show that formal k-anonymity can coexist with substantial empirical linkability, that Fellegi–Sunter can over-link under representation shifts, and that behavioral fingerprints—rather than demographics—dominate linkage risk. CVPL thus offers interpretable diagnostics for safety evaluation, mechanism comparison, and utility–risk trade-off analysis, while remaining a complement—not a replacement—to formal privacy guarantees.

Abstract

Formal privacy metrics provide compliance-oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster-Vector-Projection Linkage), a geometric framework for post-hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario-dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold-aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi-Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over-linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k-anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non-quasi-identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility-risk trade-off analysis.
Paper Structure (274 sections, 4 theorems, 54 equations, 6 figures, 29 tables, 2 algorithms)

This paper contains 274 sections, 4 theorems, 54 equations, 6 figures, 29 tables, 2 algorithms.

Key Result

Proposition 4.1

Fellegi--Sunter (FS) probabilistic record linkage fellegi1969theory can be expressed as a special case of CVPL under the following restrictive assumptions:

Figures (6)

  • Figure 1: CVPL operator pipeline: blocking restricts candidates, vectorization and projection create embeddings, similarity scoring identifies potential links.
  • Figure 2: Existential linkage risk (CVPL-LR) versus identification risk ($1/k$) under k-anonymity protection.
  • Figure 3: Risk surface showing CVPL-LR as a function of k-anonymity parameter and similarity threshold.
  • Figure 4: Distribution of similarity scores for true matches ($S^{+}$) and false matches ($S^{-}$).
  • Figure 5: Comparison of CVPL and Fellegi--Sunter linkage rates and precision.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Proposition 4.1: Fellegi--Sunter as a Special Case of CVPL
  • proof : Proof sketch
  • Proposition 4.2: Systematic Bias of Fellegi--Sunter under Representation Shift
  • Definition 4.3: Blocking Relaxation
  • Theorem 4.4: Monotonicity under Relaxation
  • proof
  • Corollary 4.5: Anytime Lower Bound