Table of Contents
Fetching ...

HashVFL: Defending Against Data Reconstruction Attacks in Vertical Federated Learning

Pengyu Qiu, Xuhong Zhang, Shouling Ji, Chong Fu, Xing Yang, Ting Wang

TL;DR

HashVFL defends vertical federated learning against data reconstruction attacks by introducing a hashing-based pipeline that binarizes intermediate representations. Key ideas include a Batch Normalization–driven bit balance, a Sign-based hash layer with Straight-Through Estimation for learnability, and predefined binary codes to enforce cross-party consistency with linear-time comparisons. Empirical results show HashVFL preserves main-task accuracy across diverse data modalities while substantially mitigating reconstruction, reducing label leakage, improving adversarial robustness, and enabling abnormal-input detection. The work highlights that hashing can provide strong privacy guarantees in VFL without sacrificing performance, and discusses limitations and potential extensions such as adaptive attacks and party-bias mitigation.

Abstract

Vertical Federated Learning (VFL) is a trending collaborative machine learning model training solution. Existing industrial frameworks employ secure multi-party computation techniques such as homomorphic encryption to ensure data security and privacy. Despite these efforts, studies have revealed that data leakage remains a risk in VFL due to the correlations between intermediate representations and raw data. Neural networks can accurately capture these correlations, allowing an adversary to reconstruct the data. This emphasizes the need for continued research into securing VFL systems. Our work shows that hashing is a promising solution to counter data reconstruction attacks. The one-way nature of hashing makes it difficult for an adversary to recover data from hash codes. However, implementing hashing in VFL presents new challenges, including vanishing gradients and information loss. To address these issues, we propose HashVFL, which integrates hashing and simultaneously achieves learnability, bit balance, and consistency. Experimental results indicate that HashVFL effectively maintains task performance while defending against data reconstruction attacks. It also brings additional benefits in reducing the degree of label leakage, mitigating adversarial attacks, and detecting abnormal inputs. We hope our work will inspire further research into the potential applications of HashVFL.

HashVFL: Defending Against Data Reconstruction Attacks in Vertical Federated Learning

TL;DR

HashVFL defends vertical federated learning against data reconstruction attacks by introducing a hashing-based pipeline that binarizes intermediate representations. Key ideas include a Batch Normalization–driven bit balance, a Sign-based hash layer with Straight-Through Estimation for learnability, and predefined binary codes to enforce cross-party consistency with linear-time comparisons. Empirical results show HashVFL preserves main-task accuracy across diverse data modalities while substantially mitigating reconstruction, reducing label leakage, improving adversarial robustness, and enabling abnormal-input detection. The work highlights that hashing can provide strong privacy guarantees in VFL without sacrificing performance, and discusses limitations and potential extensions such as adaptive attacks and party-bias mitigation.

Abstract

Vertical Federated Learning (VFL) is a trending collaborative machine learning model training solution. Existing industrial frameworks employ secure multi-party computation techniques such as homomorphic encryption to ensure data security and privacy. Despite these efforts, studies have revealed that data leakage remains a risk in VFL due to the correlations between intermediate representations and raw data. Neural networks can accurately capture these correlations, allowing an adversary to reconstruct the data. This emphasizes the need for continued research into securing VFL systems. Our work shows that hashing is a promising solution to counter data reconstruction attacks. The one-way nature of hashing makes it difficult for an adversary to recover data from hash codes. However, implementing hashing in VFL presents new challenges, including vanishing gradients and information loss. To address these issues, we propose HashVFL, which integrates hashing and simultaneously achieves learnability, bit balance, and consistency. Experimental results indicate that HashVFL effectively maintains task performance while defending against data reconstruction attacks. It also brings additional benefits in reducing the degree of label leakage, mitigating adversarial attacks, and detecting abnormal inputs. We hope our work will inspire further research into the potential applications of HashVFL.
Paper Structure (45 sections, 13 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 45 sections, 13 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: A typical scenario for using VFL involves Party A, an e-commerce company with features 1 and 2, and Party B, a bank with features 3, 4, 5 and the label. Together, they train a model that predicts loan approval decisions.
  • Figure 2: Overview of HashVFL. 1) Each party uses its bottom model to extract abstractions from local data. 2) The extracted abstractions are then normalized through a BN layer. 3) Normalized abstractions obtained from 2) are binarized by the hash layer and then uploaded to the server. 4) The server uses the top model to calculate classification loss and the distance between these codes and their target pre-defined binary codes. 5) The server calculates the gradients and transmits them back to corresponding parties. Please note that the gradients pass through the hash layer as they are due to the utilization of the Straight-Through Estimator (STE), enabling updates to commence from the batch normalization layer.
  • Figure 3: Revealing common features on MNIST. The first row shows the reconstructed results with 4 bits hash code on different classes. The second and third rows show 8 and 16 bits, respectively.
  • Figure 4: Impact of number of parties on accuracy. X-axis is the number of parties involved, while the Y-axis is the accuracy achieved on the test set.
  • Figure 5: Impact of feature ratio on accuracy. X-axis is the ratio of features held by one party, while the Y-axis is the accuracy achieved on the test set.
  • ...and 1 more figures