Table of Contents
Fetching ...

Towards Split Learning-based Privacy-Preserving Record Linkage

Michail Zervas, Alexandros Karakasidis

TL;DR

This paper addressesPrivacy-Preserving Record Linkage (PPRL) by introducing a Split Learning (SL) framework that trains Support Vector Machines (SVMs) locally at dataholders using smashed representations derived from a common Reference Set (RS). The method eliminates the need for a Linkage Unit and preserves privacy by exchanging only distance-based smashed data, while employing synthetic training data to enable effective local model training. Empirical results on North Carolina voter datasets show that the approach achieves precision and recall close to a centralized SVM with measurable privacy-time trade-offs; increasing RS size has limited impact on accuracy, while larger training sets improve precision, and the overhead is modest relative to the privacy gains. The work lays a foundation for privacy-aware record matching with potential extensions via differential privacy to provide formal guarantees.

Abstract

Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.

Towards Split Learning-based Privacy-Preserving Record Linkage

TL;DR

This paper addressesPrivacy-Preserving Record Linkage (PPRL) by introducing a Split Learning (SL) framework that trains Support Vector Machines (SVMs) locally at dataholders using smashed representations derived from a common Reference Set (RS). The method eliminates the need for a Linkage Unit and preserves privacy by exchanging only distance-based smashed data, while employing synthetic training data to enable effective local model training. Empirical results on North Carolina voter datasets show that the approach achieves precision and recall close to a centralized SVM with measurable privacy-time trade-offs; increasing RS size has limited impact on accuracy, while larger training sets improve precision, and the overhead is modest relative to the privacy gains. The work lays a foundation for privacy-aware record matching with potential extensions via differential privacy to provide formal guarantees.

Abstract

Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.
Paper Structure (19 sections, 1 equation, 5 figures, 3 algorithms)

This paper contains 19 sections, 1 equation, 5 figures, 3 algorithms.

Figures (5)

  • Figure 1: Method Precision vs. RS size.
  • Figure 2: Method Recall vs. RS size.
  • Figure 3: Method Precision vs. Training Set size.
  • Figure 4: Method Recall vs. Training Set size.
  • Figure 6: Matching times comparison.

Theorems & Definitions (2)

  • Example 1
  • Example 2