TreeCSS: An Efficient Framework for Vertical Federated Learning
Qinbo Zhang, Xiao Yan, Yukai Ding, Quanqing Xu, Chuang Hu, Xiaokai Zhou, Jiawei Jiang
TL;DR
TreeCSS tackles the core scalability bottlenecks of vertical federated learning by combining Tree-MPSI for scalable data alignment with a clustering-based coreset strategy for training. The Tree-MPSI component reduces interaction rounds via a tree structure and volume-aware scheduling, while Cluster-Coreset uses K-Means clustering across participants, encrypted cluster tuples, and sample weighting to produce a compact, informative training set. Across six datasets and multiple models, TreeCSS achieves up to $2.93\times$ end-to-end speedups with comparable accuracy to vanilla VFL, while preserving privacy through homomorphic encryption. The approach generalizes across tasks and models, reducing data and communication burdens and enabling scalable, privacy-preserving VFL in practical deployments.
Abstract
Vertical federated learning (VFL) considers the case that the features of data samples are partitioned over different participants. VFL consists of two main steps, i.e., identify the common data samples for all participants (alignment) and train model using the aligned data samples (training). However, when there are many participants and data samples, both alignment and training become slow. As such, we propose TreeCSS as an efficient VFL framework that accelerates the two main steps. In particular, for sample alignment, we design an efficient multi-party private set intersection (MPSI) protocol called Tree-MPSI, which adopts a tree-based structure and a data-volume-aware scheduling strategy to parallelize alignment among the participants. As model training time scales with the number of data samples, we conduct coreset selection (CSS) to choose some representative data samples for training. Our CCS method adopts a clustering-based scheme for security and generality, which first clusters the features locally on each participant and then merges the local clustering results to select representative samples. In addition, we weight the samples according to their distances to the centroids to reflect their importance to model training. We evaluate the effectiveness and efficiency of our TreeCSS framework on various datasets and models. The results show that compared with vanilla VFL, TreeCSS accelerates training by up to 2.93x and achieves comparable model accuracy.
