Table of Contents
Fetching ...

Reliable Imputed-Sample Assisted Vertical Federated Learning

Yaopei Zeng, Lei Liu, Shaoguo Liu, Hongjian Dou, Baoyuan Wu, Li Liu

TL;DR

The paper addresses the challenge of limited overlapping samples in vertical federated learning (VFL) by proposing RISA, a two-stage framework that imputes non-overlapping samples using mean imputation and assigns pseudo-labels via self-training. It then employs evidential deep learning with a Reduced Yager's Rule to quantify and fuse uncertainty across parties, enabling reliable, uncertainty-aware training. The authors demonstrate that RISA yields significant gains on CIFAR-10 and Criteo, with up to $48\%$ accuracy improvement when overlap is as low as $1\%$, highlighting the practical value of leveraging non-overlapping data under privacy constraints. This work provides a principled mechanism to mitigate imputation noise and improve VFL performance in data-scarce settings, with potential broad impact on privacy-preserving collaborative learning.

Abstract

Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored techniques for imputing missing values in samples, but often without adequate attention to the quality of the imputed samples. To address this issue, we propose a Reliable Imputed-Sample Assisted (RISA) VFL framework to effectively exploit non-overlapping samples by selecting reliable imputed samples for training VFL models. Specifically, after imputing non-overlapping samples, we introduce evidence theory to estimate the uncertainty of imputed samples, and only samples with low uncertainty are selected. In this way, high-quality non-overlapping samples are utilized to improve VFL model. Experiments on two widely used datasets demonstrate the significant performance gains achieved by the RISA, especially with the limited overlapping samples, e.g., a 48% accuracy gain on CIFAR-10 with only 1% overlapping samples.

Reliable Imputed-Sample Assisted Vertical Federated Learning

TL;DR

The paper addresses the challenge of limited overlapping samples in vertical federated learning (VFL) by proposing RISA, a two-stage framework that imputes non-overlapping samples using mean imputation and assigns pseudo-labels via self-training. It then employs evidential deep learning with a Reduced Yager's Rule to quantify and fuse uncertainty across parties, enabling reliable, uncertainty-aware training. The authors demonstrate that RISA yields significant gains on CIFAR-10 and Criteo, with up to accuracy improvement when overlap is as low as , highlighting the practical value of leveraging non-overlapping data under privacy constraints. This work provides a principled mechanism to mitigate imputation noise and improve VFL performance in data-scarce settings, with potential broad impact on privacy-preserving collaborative learning.

Abstract

Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored techniques for imputing missing values in samples, but often without adequate attention to the quality of the imputed samples. To address this issue, we propose a Reliable Imputed-Sample Assisted (RISA) VFL framework to effectively exploit non-overlapping samples by selecting reliable imputed samples for training VFL models. Specifically, after imputing non-overlapping samples, we introduce evidence theory to estimate the uncertainty of imputed samples, and only samples with low uncertainty are selected. In this way, high-quality non-overlapping samples are utilized to improve VFL model. Experiments on two widely used datasets demonstrate the significant performance gains achieved by the RISA, especially with the limited overlapping samples, e.g., a 48% accuracy gain on CIFAR-10 with only 1% overlapping samples.
Paper Structure (11 sections, 8 equations, 2 figures, 3 tables)

This paper contains 11 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the virtual dataset in the context of VFL involving $M$ parties. Each party holds a vertical proportion of this dataset. The party with the labels is known as the active party, whereas the other parties are referred to as passive. Sample 2 is shared among parties, named overlapping samples. Samples 1 and 3 are non-overlapping since they have missing attributes and labels. The absent attributes and labels are depicted as hollow rectangles.
  • Figure 2: The pipeline of the proposed RISA framework consists of three stages. In Stage 1, RISA adopts the mean imputation to impute attributes and the self-training strategy to assign pseudo-labels for non-overlapping samples in the passive party (i.e., Party 2). In Stage 2, RISA projects the features into evidence vectors with MLPs. Subjective opinions are formed from each party based on DST theory and fused using the combination rule described in Definition \ref{['combination']} for the final prediction.

Theorems & Definitions (1)

  • Definition 1