Embedding based Encoding Scheme for Privacy Preserving Record Linkage
Sirintra Vaiwsri, Thilina Ranbaduge
TL;DR
The work tackles privacy-preserving record linkage by introducing Embedding based Encoding Scheme (EmbBin), which encodes q-grams via CBOW embeddings into binary strings for secure cross-database matching. Utilizing a three-party protocol with two DOs and a LU, EmbBin achieves high linkage quality while constraining information leakage, particularly for short records. Empirical results on NCVR and DBLP datasets show EmbBin often outperforms TabHash and 2SH in linkage quality and privacy, and generally surpasses Bloom filters in privacy, albeit with some trade-offs for very long records. The study highlights EmbBin as a practical, flexible approach that balances utility and privacy, while pointing to future improvements through alternative embedding techniques.
Abstract
To discover new insights from data, there is a growing need to share information that is often held by different organisations. One key task in data integration is the calculation of similarities between records in different databases to identify pairs or sets of records that correspond to the same real-world entities. Due to privacy and confidentiality concerns, however, the owners of sensitive databases are often not allowed or willing to exchange or share their data with other organisations to allow such similarity calculations. Privacy-preserving record linkage (PPRL) is the process of matching records that refer to the same entity across sensitive databases held by different organisations while ensuring no information about the entities is revealed to the participating parties. In this paper, we study how embedding based encoding techniques can be applied in the PPRL context to ensure the privacy of the entities that are being linked. We first convert individual q-grams into the embedded space and then convert the embedding of a set of q-grams of a given record into a binary representation. The final binary representations can be used to link records into matches and non-matches. We empirically evaluate our proposed encoding technique against different real-world datasets. The results suggest that our proposed encoding approach can provide better linkage accuracy and protect the privacy of individuals against attack compared to state-of-the-art techniques for short record values.
