Table of Contents
Fetching ...

Hyperparameter Optimization for SecureBoost via Constrained Multi-Objective Federated Learning

Yan Kang, Ziyao Ren, Lixin Fan, Linghua Yang, Yongxin Tong, Qiang Yang

TL;DR

The paper tackles privacy leakage and suboptimal hyperparameter choices in SecureBoost within vertical federated learning. It introduces Instance Clustering Attack (ICA) to quantify label leakage and two defenses (Local Trees and Purity Threshold) to mitigate it. Building on NSGA-II, the Constrained Multi-Objective SecureBoost (CMOSB) algorithm optimizes three objectives—utility loss $\epsilon_u$, training cost $\epsilon_c$, and privacy leakage $\epsilon_p$—while enforcing constraints to yield Pareto-optimal hyperparameters. Experiments on four datasets show CMOSB outperforms grid search and Bayesian optimization, delivering better trade-offs between privacy, utility, and efficiency, with clear practical implications for trustworthy VFL systems.

Abstract

SecureBoost is a tree-boosting algorithm that leverages homomorphic encryption (HE) to protect data privacy in vertical federated learning. SecureBoost and its variants have been widely adopted in fields such as finance and healthcare. However, the hyperparameters of SecureBoost are typically configured heuristically for optimizing model performance (i.e., utility) solely, assuming that privacy is secured. Our study found that SecureBoost and some of its variants are still vulnerable to label leakage. This vulnerability may lead the current heuristic hyperparameter configuration of SecureBoost to a suboptimal trade-off between utility, privacy, and efficiency, which are pivotal elements toward a trustworthy federated learning system. To address this issue, we propose the Constrained Multi-Objective SecureBoost (CMOSB) algorithm, which aims to approximate Pareto optimal solutions that each solution is a set of hyperparameters achieving an optimal trade-off between utility loss, training cost, and privacy leakage. We design measurements of the three objectives, including a novel label inference attack named instance clustering attack (ICA) to measure the privacy leakage of SecureBoost. Additionally, we provide two countermeasures against ICA. The experimental results demonstrate that the CMOSB yields superior hyperparameters over those optimized by grid search and Bayesian optimization regarding the trade-off between utility loss, training cost, and privacy leakage.

Hyperparameter Optimization for SecureBoost via Constrained Multi-Objective Federated Learning

TL;DR

The paper tackles privacy leakage and suboptimal hyperparameter choices in SecureBoost within vertical federated learning. It introduces Instance Clustering Attack (ICA) to quantify label leakage and two defenses (Local Trees and Purity Threshold) to mitigate it. Building on NSGA-II, the Constrained Multi-Objective SecureBoost (CMOSB) algorithm optimizes three objectives—utility loss , training cost , and privacy leakage —while enforcing constraints to yield Pareto-optimal hyperparameters. Experiments on four datasets show CMOSB outperforms grid search and Bayesian optimization, delivering better trade-offs between privacy, utility, and efficiency, with clear practical implications for trustworthy VFL systems.

Abstract

SecureBoost is a tree-boosting algorithm that leverages homomorphic encryption (HE) to protect data privacy in vertical federated learning. SecureBoost and its variants have been widely adopted in fields such as finance and healthcare. However, the hyperparameters of SecureBoost are typically configured heuristically for optimizing model performance (i.e., utility) solely, assuming that privacy is secured. Our study found that SecureBoost and some of its variants are still vulnerable to label leakage. This vulnerability may lead the current heuristic hyperparameter configuration of SecureBoost to a suboptimal trade-off between utility, privacy, and efficiency, which are pivotal elements toward a trustworthy federated learning system. To address this issue, we propose the Constrained Multi-Objective SecureBoost (CMOSB) algorithm, which aims to approximate Pareto optimal solutions that each solution is a set of hyperparameters achieving an optimal trade-off between utility loss, training cost, and privacy leakage. We design measurements of the three objectives, including a novel label inference attack named instance clustering attack (ICA) to measure the privacy leakage of SecureBoost. Additionally, we provide two countermeasures against ICA. The experimental results demonstrate that the CMOSB yields superior hyperparameters over those optimized by grid search and Bayesian optimization regarding the trade-off between utility loss, training cost, and privacy leakage.
Paper Structure (22 sections, 6 equations, 11 figures, 4 tables, 4 algorithms)

This paper contains 22 sections, 6 equations, 11 figures, 4 tables, 4 algorithms.

Figures (11)

  • Figure 1: An illustration of data partition in VFL.
  • Figure 2: The workflow of instance clustering attack. (1) The attacker constructs a similarity matrix based on the instance distribution. (2) The attacker clusters the training instances based on the similarity matrix. (3) The attacker infers labels of unlabeled instances based on the known labels.
  • Figure 3: Mutual information between instance distribution and labels. BC: binary classification; MC: multi-class classification. Higher mutual information implies a higher likelihood of privacy leakage. The mutual information sharply decreases in the first few trees and then starts fluctuating.
  • Figure 4: Illustration of the purity threshold method. (1) The active party sends instance distribution and calculates the optimal split point based on statistical data. (2) The active party performs a purity check of the optimal split point. (3) If the purity exceeds $p$ the threshold $\theta_p$, it builds the subtree locally; otherwise, the training continues normally.
  • Figure 5: Effectiveness of Defense Methods. The yellow line represents privacy leakage, where lower values indicate a more secure model. The blue line represents utility loss, where lower values indicate better model performance. The first and second columns are experiments on Local Trees and Purity Threshold, respectively. BC: binary classification; MC: multi-class classification. Reducing $p$ or increasing $n_l$ can decrease privacy leakage while sacrificing utility.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Remark 1
  • Definition 1: Pareto Dominance
  • Definition 2: Pareto Optimal Solution
  • Definition 3: Pareto Set and Front
  • Definition 4: Hypervolume Indicator