Reliable Imputed-Sample Assisted Vertical Federated Learning
Yaopei Zeng, Lei Liu, Shaoguo Liu, Hongjian Dou, Baoyuan Wu, Li Liu
TL;DR
The paper addresses the challenge of limited overlapping samples in vertical federated learning (VFL) by proposing RISA, a two-stage framework that imputes non-overlapping samples using mean imputation and assigns pseudo-labels via self-training. It then employs evidential deep learning with a Reduced Yager's Rule to quantify and fuse uncertainty across parties, enabling reliable, uncertainty-aware training. The authors demonstrate that RISA yields significant gains on CIFAR-10 and Criteo, with up to $48\%$ accuracy improvement when overlap is as low as $1\%$, highlighting the practical value of leveraging non-overlapping data under privacy constraints. This work provides a principled mechanism to mitigate imputation noise and improve VFL performance in data-scarce settings, with potential broad impact on privacy-preserving collaborative learning.
Abstract
Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored techniques for imputing missing values in samples, but often without adequate attention to the quality of the imputed samples. To address this issue, we propose a Reliable Imputed-Sample Assisted (RISA) VFL framework to effectively exploit non-overlapping samples by selecting reliable imputed samples for training VFL models. Specifically, after imputing non-overlapping samples, we introduce evidence theory to estimate the uncertainty of imputed samples, and only samples with low uncertainty are selected. In this way, high-quality non-overlapping samples are utilized to improve VFL model. Experiments on two widely used datasets demonstrate the significant performance gains achieved by the RISA, especially with the limited overlapping samples, e.g., a 48% accuracy gain on CIFAR-10 with only 1% overlapping samples.
