Table of Contents
Fetching ...

Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption

Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, Brian Thorne

TL;DR

This work presents a privacy-preserving framework for learning a linear model on vertically partitioned data using entity resolution and additively homomorphic encryption. It provides an end-to-end three-party protocol (A,B,C) combining privacy-preserving entity matching with secure federated logistic regression, enabling learning without exposing raw data or cross-dataset matches. The authors offer the first formal analysis of how entity-resolution errors affect learning, proving robustness in large-margin regimes and establishing convergence and generalization bounds under structured permutation errors. Empirical results show Taylor-loss-based learning converges comparably to logistic loss, with scalability to millions of rows and features and accuracy close to learning on perfectly linked data, thereby supporting federated learning when data integration yields significant predictive gains.

Abstract

Consider two data providers, each maintaining private records of different feature sets about common entities. They aim to learn a linear model jointly in a federated setting, namely, data is local and a shared model is trained from locally computed updates. In contrast with most work on distributed learning, in this scenario (i) data is split vertically, i.e. by features, (ii) only one data provider knows the target variable and (iii) entities are not linked across the data providers. Hence, to the challenge of private learning, we add the potentially negative consequences of mistakes in entity resolution. Our contribution is twofold. First, we describe a three-party end-to-end solution in two phases ---privacy-preserving entity resolution and federated logistic regression over messages encrypted with an additively homomorphic scheme---, secure against a honest-but-curious adversary. The system allows learning without either exposing data in the clear or sharing which entities the data providers have in common. Our implementation is as accurate as a naive non-private solution that brings all data in one place, and scales to problems with millions of entities with hundreds of features. Second, we provide what is to our knowledge the first formal analysis of the impact of entity resolution's mistakes on learning, with results on how optimal classifiers, empirical losses, margins and generalisation abilities are affected. Our results bring a clear and strong support for federated learning: under reasonable assumptions on the number and magnitude of entity resolution's mistakes, it can be extremely beneficial to carry out federated learning in the setting where each peer's data provides a significant uplift to the other.

Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption

TL;DR

This work presents a privacy-preserving framework for learning a linear model on vertically partitioned data using entity resolution and additively homomorphic encryption. It provides an end-to-end three-party protocol (A,B,C) combining privacy-preserving entity matching with secure federated logistic regression, enabling learning without exposing raw data or cross-dataset matches. The authors offer the first formal analysis of how entity-resolution errors affect learning, proving robustness in large-margin regimes and establishing convergence and generalization bounds under structured permutation errors. Empirical results show Taylor-loss-based learning converges comparably to logistic loss, with scalability to millions of rows and features and accuracy close to learning on perfectly linked data, thereby supporting federated learning when data integration yields significant predictive gains.

Abstract

Consider two data providers, each maintaining private records of different feature sets about common entities. They aim to learn a linear model jointly in a federated setting, namely, data is local and a shared model is trained from locally computed updates. In contrast with most work on distributed learning, in this scenario (i) data is split vertically, i.e. by features, (ii) only one data provider knows the target variable and (iii) entities are not linked across the data providers. Hence, to the challenge of private learning, we add the potentially negative consequences of mistakes in entity resolution. Our contribution is twofold. First, we describe a three-party end-to-end solution in two phases ---privacy-preserving entity resolution and federated logistic regression over messages encrypted with an additively homomorphic scheme---, secure against a honest-but-curious adversary. The system allows learning without either exposing data in the clear or sharing which entities the data providers have in common. Our implementation is as accurate as a naive non-private solution that brings all data in one place, and scales to problems with millions of entities with hundreds of features. Second, we provide what is to our knowledge the first formal analysis of the impact of entity resolution's mistakes on learning, with results on how optimal classifiers, empirical losses, margins and generalisation abilities are affected. Our results bring a clear and strong support for federated learning: under reasonable assumptions on the number and magnitude of entity resolution's mistakes, it can be extremely beneficial to carry out federated learning in the setting where each peer's data provides a significant uplift to the other.

Paper Structure

This paper contains 37 sections, 18 theorems, 199 equations, 9 figures, 4 tables, 5 algorithms.

Key Result

Theorem 6

Suppose $\mathsf{P}_*$ is $(\varepsilon, \tau)$-accurate and the data-model calibration assumption holds. Then the following holds: If, furthermore, $\mathsf{P}_*$ is $\alpha$-bounded, then we get with $C(n) \stackrel{\mathrm{.}}{=} (\xi / n)^{\alpha}$.

Figures (9)

  • Figure 1: Relationships between the Coordinator, $\mathsf{C}$, and the Data Providers, $\mathsf{A}$ and $\mathsf{B}$.
  • Figure 2: The problem of entity resolution.
  • Figure 3: Overview of the notation for, and relationships between, the different variables in logistic regression.
  • Figure 4: Loss profiles.
  • Figure 5: \ref{['fig:last0']} Learning curve for Taylor vs. logistic loss (straight lines) and their test error (dotted); \ref{['fig:last1']} runtime of entity matching with respect to the size of the two datasets; runtime of one learning epoch (all mini-batches + hold-out loss evaluation) with respect to number of examples \ref{['fig:last2']} and features \ref{['fig:last3']}.
  • ...and 4 more figures

Theorems & Definitions (26)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 6
  • Definition 7
  • Theorem 8
  • Theorem 9
  • Theorem 10
  • ...and 16 more