Table of Contents
Fetching ...

Feasibility of Privacy-Preserving Entity Resolution on Confidential Healthcare Datasets Using Homomorphic Encryption

Yixiang Yao, Joseph Cecil, Praveen Angyan, Neil Bahroos, Srivatsan Ravi

TL;DR

This work tackles privacy-preserving entity resolution across confidential healthcare datasets under regulatory constraints. It implements a CKKS-based AMPPERE pipeline, augmented with record-based chunking, SIMD, parallel processing, and tailored parameter tuning to achieve scalable, accurate matching while keeping data encrypted. Empirical results on mortality-related datasets show near-complete blocking (PC ≈ 100%), very high reduction in candidate comparisons (RR ≈ 99.996%), and strong ER performance, with substantial runtime improvements (up to 24x faster than AMPPERE and 447x faster than naive HE). The approach demonstrates practical HIPAA/GDPR-compliant data linkage for healthcare research, enabling secure cross-institution data sharing and collective decryption of results without exposing underlying records.

Abstract

Patient datasets contain confidential information which is protected by laws and regulations such as HIPAA and GDPR. Ensuring comprehensive patient information necessitates privacy-preserving entity resolution (PPER), which identifies identical patient entities across multiple databases from different healthcare organizations while maintaining data privacy. Existing methods often lack cryptographic security or are computationally impractical for real-world datasets. We introduce a PPER pipeline based on AMPPERE, a secure abstract computation model utilizing cryptographic tools like homomorphic encryption. Our tailored approach incorporates extensive parallelization techniques and optimal parameters specifically for patient datasets. Experimental results demonstrate the proposed method's effectiveness in terms of accuracy and efficiency compared to various baselines.

Feasibility of Privacy-Preserving Entity Resolution on Confidential Healthcare Datasets Using Homomorphic Encryption

TL;DR

This work tackles privacy-preserving entity resolution across confidential healthcare datasets under regulatory constraints. It implements a CKKS-based AMPPERE pipeline, augmented with record-based chunking, SIMD, parallel processing, and tailored parameter tuning to achieve scalable, accurate matching while keeping data encrypted. Empirical results on mortality-related datasets show near-complete blocking (PC ≈ 100%), very high reduction in candidate comparisons (RR ≈ 99.996%), and strong ER performance, with substantial runtime improvements (up to 24x faster than AMPPERE and 447x faster than naive HE). The approach demonstrates practical HIPAA/GDPR-compliant data linkage for healthcare research, enabling secure cross-institution data sharing and collective decryption of results without exposing underlying records.

Abstract

Patient datasets contain confidential information which is protected by laws and regulations such as HIPAA and GDPR. Ensuring comprehensive patient information necessitates privacy-preserving entity resolution (PPER), which identifies identical patient entities across multiple databases from different healthcare organizations while maintaining data privacy. Existing methods often lack cryptographic security or are computationally impractical for real-world datasets. We introduce a PPER pipeline based on AMPPERE, a secure abstract computation model utilizing cryptographic tools like homomorphic encryption. Our tailored approach incorporates extensive parallelization techniques and optimal parameters specifically for patient datasets. Experimental results demonstrate the proposed method's effectiveness in terms of accuracy and efficiency compared to various baselines.
Paper Structure (17 sections, 9 figures, 3 tables, 3 algorithms)

This paper contains 17 sections, 9 figures, 3 tables, 3 algorithms.

Figures (9)

  • Figure 1: The overview of the dataset and the pipeline. Arrows and boxes in orange indicate the data is in cipher. Each record, after pre-processing and tokenization, is encrypted into ciphertext. All records from both datasets are then sent to the optimized AMPPERE pipeline, which homomorphically evaluates the potential matches and produces record pairs with scores in encrypted form. Finally, the data owners decrypt the results collectively, whereas the adversary cannot gather any information.
  • Figure 2: The timing diagram of AMPPERE
  • Figure 3: Chunking strategies. Data owners compute blocks independently. In the case of the blocking-based method, these blocks are shared to form candidate pairs, which are subsequently chunked. Whereas for the record-based approach, chunking occurs directly on the records, and candidate pairs are generated via private matrix manipulation within each chunk.
  • Figure 4: ROC curve for the ER system.
  • Figure 5: Estimated runtime for our optimized system versus two baselines.
  • ...and 4 more figures