Table of Contents
Fetching ...

Embedding based Encoding Scheme for Privacy Preserving Record Linkage

Sirintra Vaiwsri, Thilina Ranbaduge

TL;DR

The work tackles privacy-preserving record linkage by introducing Embedding based Encoding Scheme (EmbBin), which encodes q-grams via CBOW embeddings into binary strings for secure cross-database matching. Utilizing a three-party protocol with two DOs and a LU, EmbBin achieves high linkage quality while constraining information leakage, particularly for short records. Empirical results on NCVR and DBLP datasets show EmbBin often outperforms TabHash and 2SH in linkage quality and privacy, and generally surpasses Bloom filters in privacy, albeit with some trade-offs for very long records. The study highlights EmbBin as a practical, flexible approach that balances utility and privacy, while pointing to future improvements through alternative embedding techniques.

Abstract

To discover new insights from data, there is a growing need to share information that is often held by different organisations. One key task in data integration is the calculation of similarities between records in different databases to identify pairs or sets of records that correspond to the same real-world entities. Due to privacy and confidentiality concerns, however, the owners of sensitive databases are often not allowed or willing to exchange or share their data with other organisations to allow such similarity calculations. Privacy-preserving record linkage (PPRL) is the process of matching records that refer to the same entity across sensitive databases held by different organisations while ensuring no information about the entities is revealed to the participating parties. In this paper, we study how embedding based encoding techniques can be applied in the PPRL context to ensure the privacy of the entities that are being linked. We first convert individual q-grams into the embedded space and then convert the embedding of a set of q-grams of a given record into a binary representation. The final binary representations can be used to link records into matches and non-matches. We empirically evaluate our proposed encoding technique against different real-world datasets. The results suggest that our proposed encoding approach can provide better linkage accuracy and protect the privacy of individuals against attack compared to state-of-the-art techniques for short record values.

Embedding based Encoding Scheme for Privacy Preserving Record Linkage

TL;DR

The work tackles privacy-preserving record linkage by introducing Embedding based Encoding Scheme (EmbBin), which encodes q-grams via CBOW embeddings into binary strings for secure cross-database matching. Utilizing a three-party protocol with two DOs and a LU, EmbBin achieves high linkage quality while constraining information leakage, particularly for short records. Empirical results on NCVR and DBLP datasets show EmbBin often outperforms TabHash and 2SH in linkage quality and privacy, and generally surpasses Bloom filters in privacy, albeit with some trade-offs for very long records. The study highlights EmbBin as a practical, flexible approach that balances utility and privacy, while pointing to future improvements through alternative embedding techniques.

Abstract

To discover new insights from data, there is a growing need to share information that is often held by different organisations. One key task in data integration is the calculation of similarities between records in different databases to identify pairs or sets of records that correspond to the same real-world entities. Due to privacy and confidentiality concerns, however, the owners of sensitive databases are often not allowed or willing to exchange or share their data with other organisations to allow such similarity calculations. Privacy-preserving record linkage (PPRL) is the process of matching records that refer to the same entity across sensitive databases held by different organisations while ensuring no information about the entities is revealed to the participating parties. In this paper, we study how embedding based encoding techniques can be applied in the PPRL context to ensure the privacy of the entities that are being linked. We first convert individual q-grams into the embedded space and then convert the embedding of a set of q-grams of a given record into a binary representation. The final binary representations can be used to link records into matches and non-matches. We empirically evaluate our proposed encoding technique against different real-world datasets. The results suggest that our proposed encoding approach can provide better linkage accuracy and protect the privacy of individuals against attack compared to state-of-the-art techniques for short record values.

Paper Structure

This paper contains 19 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The Dice coefficient similarity Chr12 calculation between the names "peter" and "pete", converted into bigrams ($q=2$) and encoded into two Bloom filters $\mathbf{b}_1$ and $\mathbf{b}_2$ of length $l=12$ bits using $k=2$ hash functions. The $1$ bits shown in italics at position $6$ is a hash collision, because both "pe" and "te" are hashed to this position.
  • Figure 2: Overview protocol of our approach. The rounded blue boxes are the DOs' databases and the LU. The data preparation step is shown in yellow colour, while the encoding steps are shown in orange colour. The binary strings (encoded values) are sent to the LU for the comparison step which is shown in purple colour under the LU.
  • Figure 3: Example of encodings and comparison processes. The embeddings of all possible q-grams were first generated, and then each q-gram was encoded into a binary string. The embeddings and binary strings of all possible q-grams are shown in the orange boxes. The pink boxes show the binary string $peter$ of the first database and the binary string $pete$ of the second database. Each pink box shows the matrix $\mathbf{M}_\mathbf{P}$, the matrix $\mathbf{T}$, and the final binary string $b$ of the string, where each binary string was generated using $k = 5$. The final binary strings of the two databases are compared using the Dice similarity as shown in the purple box.
  • Figure 4: Runtime a DO uses for data preparation and encoding for different approaches on different data sets.
  • Figure 5: Runtime the LU uses for comparing encoded records for different approaches on different datasets.
  • ...and 1 more figures