Table of Contents
Fetching ...

Contrastive Federated Learning with Tabular Data Silos

Achmad Ginanjar, Xue Li, Wen Hua, Jiaming Pei

TL;DR

This paper tackles learning from vertically partitioned tabular data silos under strict privacy by introducing Contrastive Federated Learning with Tabular Data Silos (CFL). CFL combines local contrastive learning with federated aggregation, using pre-processing steps like zero-fill and Pearson reordering, and a dot-product-based contrastive loss to handle sample misalignment without sharing raw data. The method aggregates encoder/decoder parameters via FedAvg, producing silo-specific encoders that achieve performance close to or surpassing models trained on global data, across various data-imbalance scenarios. Empirical results across six datasets demonstrate CFL’s robustness, improved recall, and privacy-preserving advantages, with additional gains from integrating with LightGBM and from the Pearson reordering technique, highlighting its practical impact for privacy-sensitive, cross-silo tabular learning.

Abstract

Learning from vertical partitioned data silos is challenging due to the segmented nature of data, sample misalignment, and strict privacy concerns. Federated learning has been proposed as a solution. However, sample misalignment across silos often hinders optimal model performance and suggests data sharing within the model, which breaks privacy. Our proposed solution is Contrastive Federated Learning with Tabular Data Silos (CFL), which offers a solution for data silos with sample misalignment without the need for sharing original or representative data to maintain privacy. CFL begins with local acquisition of contrastive representations of the data within each silo and aggregates knowledge from other silos through the federated learning algorithm. Our experiments demonstrate that CFL solves the limitations of existing algorithms for data silos and outperforms existing tabular contrastive learning. CFL provides performance improvements without loosening privacy.

Contrastive Federated Learning with Tabular Data Silos

TL;DR

This paper tackles learning from vertically partitioned tabular data silos under strict privacy by introducing Contrastive Federated Learning with Tabular Data Silos (CFL). CFL combines local contrastive learning with federated aggregation, using pre-processing steps like zero-fill and Pearson reordering, and a dot-product-based contrastive loss to handle sample misalignment without sharing raw data. The method aggregates encoder/decoder parameters via FedAvg, producing silo-specific encoders that achieve performance close to or surpassing models trained on global data, across various data-imbalance scenarios. Empirical results across six datasets demonstrate CFL’s robustness, improved recall, and privacy-preserving advantages, with additional gains from integrating with LightGBM and from the Pearson reordering technique, highlighting its practical impact for privacy-sensitive, cross-silo tabular learning.

Abstract

Learning from vertical partitioned data silos is challenging due to the segmented nature of data, sample misalignment, and strict privacy concerns. Federated learning has been proposed as a solution. However, sample misalignment across silos often hinders optimal model performance and suggests data sharing within the model, which breaks privacy. Our proposed solution is Contrastive Federated Learning with Tabular Data Silos (CFL), which offers a solution for data silos with sample misalignment without the need for sharing original or representative data to maintain privacy. CFL begins with local acquisition of contrastive representations of the data within each silo and aggregates knowledge from other silos through the federated learning algorithm. Our experiments demonstrate that CFL solves the limitations of existing algorithms for data silos and outperforms existing tabular contrastive learning. CFL provides performance improvements without loosening privacy.
Paper Structure (45 sections, 17 equations, 14 figures, 9 tables, 2 algorithms)

This paper contains 45 sections, 17 equations, 14 figures, 9 tables, 2 algorithms.

Figures (14)

  • Figure 1: Contrastive Federated Learning with Tabular Data Silos. The (A) areas are local contrastive learning, the (B) area is server learning, and the (C) area is the objects involved in federated learning. $[x]$ is the original data matrix and $[x']$ is the output matrix for supervised inferences.
  • Figure 2: CFL leverages the power of contrastive learning (CL) to find similarities between two data slices and federated learning (FL) to share knowledge between silos. Part (b.6) shows where the CFL problem begins. (b.6) is similar to (c.5), while (a.5) is similar to (b.5). The data in (b.5) are a slice, similar to in (c.4). The representation in (b.4) is a full tuple representation because it came from (b.5), which is a slice of (b.6). Slices (c.4) and (b.5) have different column name/features. $A(.)$ is evaluation function, $g(.)$ is the global model function, $f(.)$ is the local model function
  • Figure 3: Our Pearson Ordering Processes. The original data are ordered by their Pearson correlation value to get a semantic representation useful for contrastive learning. This is to get a horizontal semantic relationship
  • Figure 4: {a,b,c,d,e,f} is the column name on the tabular data, (#1) is the representation $1^{st}$, (#2) is the representation $2^{nd}$, and (#3) is a set of data targeted for the loss calculation. In (A), the representations are generated from a single record (#3) (single ID ). In (B) and (C), the representations are generated from a set of records (#3) (several IDs). In (B), the representations are built from part of the data (#3) with some intersection (dark area in B), $\{\#1 \subseteq \#3,\#2 \subseteq \#3, \#1 \cap \#2\}$. In our CFL (C), each representation is a clone of the data (#3), $\{\#3 = \#1 = \#2\}$ / full-row representation.
  • Figure 5: Figures of data imbalance across silos. A value of -1 indicates dropped data in a client due to class size imbalance, while a value of -2 indicates dropped data in a client due to data size imbalance. (- -) or ($c$) represents a client with a data size imbalance, (. .) or ($l$) represents a client with a class size imbalance. Both are introduced to represent sample misalignment due to label costliness and non-IID within the data silo. The (--) or ($n$) represents a client without problems.
  • ...and 9 more figures