CVPL: A Geometric Framework for Post-Hoc Linkage Risk Assessment in Protected Tabular Data
Valery Khvatov, Alexey Neyman
TL;DR
CVPL introduces a geometric, post-hoc framework for assessing residual linkage risk between original and protected tabular data. By modeling linkage as a sequence of blocking, vectorization, latent projection, and similarity evaluation, CVPL provides continuous risk surfaces R(λ, τ) that capture how protection strength and attacker strictness jointly affect feasibility of plausible links. The framework is paired with a monotonicity theorem for blocking relaxations, enabling anytime risk estimation with valid lower bounds, and is demonstrated on a 10,000-record simulation across 19 protection configurations. Empirical results show that formal k-anonymity can coexist with substantial empirical linkability, that Fellegi–Sunter can over-link under representation shifts, and that behavioral fingerprints—rather than demographics—dominate linkage risk. CVPL thus offers interpretable diagnostics for safety evaluation, mechanism comparison, and utility–risk trade-off analysis, while remaining a complement—not a replacement—to formal privacy guarantees.
Abstract
Formal privacy metrics provide compliance-oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster-Vector-Projection Linkage), a geometric framework for post-hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario-dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold-aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi-Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over-linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k-anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non-quasi-identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility-risk trade-off analysis.
