Table of Contents
Fetching ...

Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap

Avi Amalanshu, Viswesh Nagaswamy, G. V. S. S. Prudhvi, Yash Sirvi, Debashish Chakravarty

TL;DR

This paper tackles the inefficiency and privacy concerns of entity alignment in Vertical Federated Learning (VFL) by introducing Entity Augmentation, a method that synthesizes labels for host activations without requiring private set intersection. By weighting and combining features across entities to generate artificial labels, the approach enables training with misaligned or partially overlapped data, improving data utilization and convergence. Empirical results across six real-world datasets show that Entity Augmentation can match or exceed the performance of traditional alignment-based VFL, with notable gains when overlap is limited (e.g., CIFAR-10 with 5% overlap) and even slight improvements under full overlap due to regularization effects. The work provides a practical, PSI-free alternative for VFL deployment and outlines future extensions to regression tasks and more sophisticated augmentation strategies.

Abstract

Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple "guest" clients and an aggregating "host" server owns labels) without communicating raw data. Traditionally, VFL involves an "entity resolution" phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an "entity alignment" step to ensure all guests are always processing the same entity's data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.

Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap

TL;DR

This paper tackles the inefficiency and privacy concerns of entity alignment in Vertical Federated Learning (VFL) by introducing Entity Augmentation, a method that synthesizes labels for host activations without requiring private set intersection. By weighting and combining features across entities to generate artificial labels, the approach enables training with misaligned or partially overlapped data, improving data utilization and convergence. Empirical results across six real-world datasets show that Entity Augmentation can match or exceed the performance of traditional alignment-based VFL, with notable gains when overlap is limited (e.g., CIFAR-10 with 5% overlap) and even slight improvements under full overlap due to regularization effects. The work provides a practical, PSI-free alternative for VFL deployment and outlines future extensions to regression tasks and more sophisticated augmentation strategies.

Abstract

Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple "guest" clients and an aggregating "host" server owns labels) without communicating raw data. Traditionally, VFL involves an "entity resolution" phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an "entity alignment" step to ensure all guests are always processing the same entity's data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.
Paper Structure (25 sections, 1 equation, 2 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Example forward pass with entity augmentation. Both clients forward activations from arbitrary inputs to the host, which is aware of the identity of said inputs. Half the features in the host input correspond to the number 1 and the other half correspond to 0. The interpolated label is their weighted average.
  • Figure 2: Training Curves for CIFAR, MNIST, Handwritten, Caltech-7, Parkinson's, and Credit Card datasets. Significant convergence improvements are observed with our method on CIFAR and MNIST. The efficacy extends to datasets like Handwritten, Caltech-7, Credit Card, and Parkinson's.