VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

Zhaomin Wu; Junyi Hou; Bingsheng He

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

Zhaomin Wu, Junyi Hou, Bingsheng He

TL;DR

The paper addresses the lack of public real-world VFL benchmarks by introducing VertiBench, which characterizes performance through party importance and inter-party correlation. It proposes synthetic-VFL generation methods controlled by Dirichlet-based importance distributions and correlation-based splits, plus a real Satellite-VFL dataset, to cover broad and realistic partitions. Key contributions include synthetic dataset generation methods, a real-world Satellite dataset, evaluation metrics (Shapley, Shapley-CMI, Pcor, Icor), and comprehensive benchmarking of leading VFL algorithms. Experiments show that synthetic partitions with matched ($\alpha$, $\beta$) approximate real VFL performance, enabling robust benchmarking and guiding future algorithm design.

Abstract

Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

TL;DR

) approximate real VFL performance, enabling robust benchmarking and guiding future algorithm design.

Abstract

Paper Structure (59 sections, 12 theorems, 40 equations, 26 figures, 9 tables, 5 algorithms)

This paper contains 59 sections, 12 theorems, 40 equations, 26 figures, 9 tables, 5 algorithms.

Introduction
Evaluate VFL Datasets
Factors that affect VFL performance
Evaluate Party Importance
Evaluate Party Correlation
Split Synthetic VFL Datasets
Split by Party Importance
Split by Party Correlation
Compare Feature Split Across Global Datasets
Experiment
Review of VFL Algorithms
Experimental Settings
VFL Accuracy
Performance Correlation: VertiBench Scope vs. Real Scope
Conclusion
...and 44 more sections

Key Result

Proposition 1

The probability mass function can be written as

Figures (26)

Figure 1: Overview of existing VFL piplines and datasets and the estimated scope of VFL datasets
Figure 2: Examples of Pcor values on different levels of correlation. $U$ means uniform distribution. Arrow direction indicates right singular vector orientation, arrow scale represents singular values.
Figure 3: Accuracy of VFL algorithms on different datasets varying imbalance and correlation
Figure 4: Mean accuracy differences: synthetic datasets vs. real datasets
Figure 5: The trend of correlation and metrics when exchanging features between two parties. mcor: multi-way correlation taylor2020multi; Pcor(i)-(j): Pcor($\mathbf{X}_i,\mathbf{X}_j$); mcor(i)-(j): mcor($\mathbf{X}_i,\mathbf{X}_j$).
...and 21 more figures

Theorems & Definitions (18)

Proposition 1
Proposition 2
Proposition 3
Theorem 1
Theorem 2
Proposition 4
Proposition 1
proof
Proposition 2
proof
...and 8 more

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

TL;DR

Abstract

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (18)