Table of Contents
Fetching ...

SecureBoost: A Lossless Federated Learning Framework

Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, Dimitrios Papadopoulos, Qiang Yang

TL;DR

The paper tackles privacy-preserving machine learning when data are vertically partitioned across organizations by introducing SecureBoost, a lossless federated gradient-boosting framework. It leverages privacy-preserving entity alignment and Paillier-based encrypted gradient aggregation to build a shared tree ensemble without exposing private data, achieving accuracy comparable to centralized, non-private models. The authors prove losslessness, analyze information leakage, and propose Reduced-Leakage SecureBoost (RL-SecureBoost) to mitigate leakage while preserving performance. Empirical results on credit datasets demonstrate scalable training and parity with non-federated baselines, highlighting practical viability for industrial applications such as credit risk analysis. The work also discusses secure inference and future enhancements via secure multi-party computation techniques to further bolster privacy guarantees.

Abstract

The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General Data Protection Regulation (GDPR) in the European Union (EU) in May 2018. The GDPR is designed to give users more control over their personal data, which motivates us to explore machine learning frameworks for data sharing that do not violate user privacy. To meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting of federated learning. SecureBoost first conducts entity alignment under a privacy-preserving protocol and then constructs boosting trees across multiple parties with a carefully designed encryption strategy. This federated learning system allows the learning process to be jointly conducted over multiple parties with common user samples but different feature sets, which corresponds to a vertically partitioned data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non-privacy-preserving approach while at the same time, reveals no information of each private data provider. We show that the SecureBoost framework is as accurate as other non-federated gradient tree-boosting algorithms that require centralized data and thus it is highly scalable and practical for industrial applications such as credit risk analysis. To this end, we discuss information leakage during the protocol execution and propose ways to provably reduce it.

SecureBoost: A Lossless Federated Learning Framework

TL;DR

The paper tackles privacy-preserving machine learning when data are vertically partitioned across organizations by introducing SecureBoost, a lossless federated gradient-boosting framework. It leverages privacy-preserving entity alignment and Paillier-based encrypted gradient aggregation to build a shared tree ensemble without exposing private data, achieving accuracy comparable to centralized, non-private models. The authors prove losslessness, analyze information leakage, and propose Reduced-Leakage SecureBoost (RL-SecureBoost) to mitigate leakage while preserving performance. Empirical results on credit datasets demonstrate scalable training and parity with non-federated baselines, highlighting practical viability for industrial applications such as credit risk analysis. The work also discusses secure inference and future enhancements via secure multi-party computation techniques to further bolster privacy guarantees.

Abstract

The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General Data Protection Regulation (GDPR) in the European Union (EU) in May 2018. The GDPR is designed to give users more control over their personal data, which motivates us to explore machine learning frameworks for data sharing that do not violate user privacy. To meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting of federated learning. SecureBoost first conducts entity alignment under a privacy-preserving protocol and then constructs boosting trees across multiple parties with a carefully designed encryption strategy. This federated learning system allows the learning process to be jointly conducted over multiple parties with common user samples but different feature sets, which corresponds to a vertically partitioned data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non-privacy-preserving approach while at the same time, reveals no information of each private data provider. We show that the SecureBoost framework is as accurate as other non-federated gradient tree-boosting algorithms that require centralized data and thus it is highly scalable and practical for industrial applications such as credit risk analysis. To this end, we discuss information leakage during the protocol execution and propose ways to provably reduce it.

Paper Structure

This paper contains 11 sections, 3 theorems, 12 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

SecureBoost is lossless, i.e. SecureBoost model $M$ and XGBoost model $M'$ would behave identically provided that the models $M$ and $M'$ are identically initialized and hyper-parameterized.

Figures (5)

  • Figure 1: Illustration of the proposed SecureBoost framework
  • Figure 2: Vertically partitioned data set
  • Figure 3: An illustration of Federated Inference
  • Figure 4: Loss convergence
  • Figure 5: Scalability Analysis of SecureBoost

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof