Table of Contents
Fetching ...

ICAFS: Inter-Client-Aware Feature Selection for Vertical Federated Learning

Ruochen Jin, Boning Tong, Shu Yang, Bojian Hou, Li Shen

TL;DR

ICAFS tackles the problem of inter-client feature selection in vertical federated learning without sharing private gradients. It introduces a three-stage framework that (i) generates label-aware synthetic data via a federated Wasserstein GAN, (ii) learns multiple gate-based embedding selectors on synthetic data and ensembles them, and (iii) applies the learned selectors to real data to refine predictions. The approach yields superior accuracy across multiple real-world datasets (including ADNI, USPS, ALLAML, and TOX_171), demonstrates scalability to more clients, and exhibits robustness to noise and privacy constraints. By decoupling feature selection from private gradients and leveraging inter-client feature interactions, ICAFS offers a practical, neural-network–friendly solution for privacy-preserving VFL FS.

Abstract

Vertical federated learning (VFL) enables a paradigm for vertically partitioned data across clients to collaboratively train machine learning models. Feature selection (FS) plays a crucial role in Vertical Federated Learning (VFL) due to the unique nature that data are distributed across multiple clients. In VFL, different clients possess distinct subsets of features for overlapping data samples, making the process of identifying and selecting the most relevant features a complex yet essential task. Previous FS efforts have primarily revolved around intra-client feature selection, overlooking vital feature interaction across clients, leading to subpar model outcomes. We introduce ICAFS, a novel multi-stage ensemble approach for effective FS in VFL by considering inter-client interactions. By employing conditional feature synthesis alongside multiple learnable feature selectors, ICAFS facilitates ensemble FS over these selectors using synthetic embeddings. This method bypasses the limitations of private gradient sharing and allows for model training using real data with refined embeddings. Experiments on multiple real-world datasets demonstrate that ICAFS surpasses current state-of-the-art methods in prediction accuracy.

ICAFS: Inter-Client-Aware Feature Selection for Vertical Federated Learning

TL;DR

ICAFS tackles the problem of inter-client feature selection in vertical federated learning without sharing private gradients. It introduces a three-stage framework that (i) generates label-aware synthetic data via a federated Wasserstein GAN, (ii) learns multiple gate-based embedding selectors on synthetic data and ensembles them, and (iii) applies the learned selectors to real data to refine predictions. The approach yields superior accuracy across multiple real-world datasets (including ADNI, USPS, ALLAML, and TOX_171), demonstrates scalability to more clients, and exhibits robustness to noise and privacy constraints. By decoupling feature selection from private gradients and leveraging inter-client feature interactions, ICAFS offers a practical, neural-network–friendly solution for privacy-preserving VFL FS.

Abstract

Vertical federated learning (VFL) enables a paradigm for vertically partitioned data across clients to collaboratively train machine learning models. Feature selection (FS) plays a crucial role in Vertical Federated Learning (VFL) due to the unique nature that data are distributed across multiple clients. In VFL, different clients possess distinct subsets of features for overlapping data samples, making the process of identifying and selecting the most relevant features a complex yet essential task. Previous FS efforts have primarily revolved around intra-client feature selection, overlooking vital feature interaction across clients, leading to subpar model outcomes. We introduce ICAFS, a novel multi-stage ensemble approach for effective FS in VFL by considering inter-client interactions. By employing conditional feature synthesis alongside multiple learnable feature selectors, ICAFS facilitates ensemble FS over these selectors using synthetic embeddings. This method bypasses the limitations of private gradient sharing and allows for model training using real data with refined embeddings. Experiments on multiple real-world datasets demonstrate that ICAFS surpasses current state-of-the-art methods in prediction accuracy.

Paper Structure

This paper contains 24 sections, 1 equation, 3 figures, 8 tables, 2 algorithms.

Figures (3)

  • Figure 1: VFL feature selection architecture comparison between Intra-Client FS and Inter-Client FS (ours). Significant embedding components (in gray) are selected. In the left panel, each client selects their own features. In the right panel, the selection is processed after concatenating embeddings from clients.
  • Figure 2: Illustration of ICAFS pipeline. There are three stages including (1) synthetic data geneartion, (2) embedding components selection with synthetic data, and (3) classification on real data. Significant embedding components (emphasized by gray with $\bm\alpha>0$) are selected after training. The server-side and client-side models are represented up and down, respectively. Synthetic data generated in Stage 1 will be used in Stage 2.
  • Figure 3: (a-b) Distribution of both original and synthetic data. (c-d) Ablation investigating on $N$ and $\beta$. (e) Training time comparison.