Table of Contents
Fetching ...

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

Zhaomin Wu, Junyi Hou, Bingsheng He

TL;DR

The paper addresses the lack of public real-world VFL benchmarks by introducing VertiBench, which characterizes performance through party importance and inter-party correlation. It proposes synthetic-VFL generation methods controlled by Dirichlet-based importance distributions and correlation-based splits, plus a real Satellite-VFL dataset, to cover broad and realistic partitions. Key contributions include synthetic dataset generation methods, a real-world Satellite dataset, evaluation metrics (Shapley, Shapley-CMI, Pcor, Icor), and comprehensive benchmarking of leading VFL algorithms. Experiments show that synthetic partitions with matched ($\alpha$, $\beta$) approximate real VFL performance, enabling robust benchmarking and guiding future algorithm design.

Abstract

Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

TL;DR

The paper addresses the lack of public real-world VFL benchmarks by introducing VertiBench, which characterizes performance through party importance and inter-party correlation. It proposes synthetic-VFL generation methods controlled by Dirichlet-based importance distributions and correlation-based splits, plus a real Satellite-VFL dataset, to cover broad and realistic partitions. Key contributions include synthetic dataset generation methods, a real-world Satellite dataset, evaluation metrics (Shapley, Shapley-CMI, Pcor, Icor), and comprehensive benchmarking of leading VFL algorithms. Experiments show that synthetic partitions with matched (, ) approximate real VFL performance, enabling robust benchmarking and guiding future algorithm design.

Abstract

Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
Paper Structure (59 sections, 12 theorems, 40 equations, 26 figures, 9 tables, 5 algorithms)

This paper contains 59 sections, 12 theorems, 40 equations, 26 figures, 9 tables, 5 algorithms.

Key Result

Proposition 1

The probability mass function can be written as

Figures (26)

  • Figure 1: Overview of existing VFL piplines and datasets and the estimated scope of VFL datasets
  • Figure 2: Examples of Pcor values on different levels of correlation. $U$ means uniform distribution. Arrow direction indicates right singular vector orientation, arrow scale represents singular values.
  • Figure 3: Accuracy of VFL algorithms on different datasets varying imbalance and correlation
  • Figure 4: Mean accuracy differences: synthetic datasets vs. real datasets
  • Figure 5: The trend of correlation and metrics when exchanging features between two parties. mcor: multi-way correlation taylor2020multi; Pcor(i)-(j): Pcor($\mathbf{X}_i,\mathbf{X}_j$); mcor(i)-(j): mcor($\mathbf{X}_i,\mathbf{X}_j$).
  • ...and 21 more figures

Theorems & Definitions (18)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Theorem 2
  • Proposition 4
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • ...and 8 more