Table of Contents
Fetching ...

A Federated Learning Benchmark on Tabular Data: Comparing Tree-Based Models and Neural Networks

William Lindskog, Christian Prehofer

TL;DR

This work addresses the challenge of applying federated learning to tabular data by benchmarking three federated tree-based models and three DNNs across 10 public datasets under varied non-IID partitions. It uses horizontal FL with label, feature, and quantity skew to compare performance, highlighting that federated boosted trees, notably Federated XGBoost, consistently outperform federated neural networks. The study also shows that tree-based federated methods maintain advantages as the number of participating clients grows, and TBMs typically outperform parametric models. The findings suggest TBMs offer robust, scalable, and privacy-preserving benefits for tabular data in FL deployments, guiding model selection and future research on partition-robust and efficient FL strategies.

Abstract

Federated Learning (FL) has lately gained traction as it addresses how machine learning models train on distributed datasets. FL was designed for parametric models, namely Deep Neural Networks (DNNs).Thus, it has shown promise on image and text tasks. However, FL for tabular data has received little attention. Tree-Based Models (TBMs) have been considered to perform better on tabular data and they are starting to see FL integrations. In this study, we benchmark federated TBMs and DNNs for horizontal FL, with varying data partitions, on 10 well-known tabular datasets. Our novel benchmark results indicates that current federated boosted TBMs perform better than federated DNNs in different data partitions. Furthermore, a federated XGBoost outperforms all other models. Lastly, we find that federated TBMs perform better than federated parametric models, even when increasing the number of clients significantly.

A Federated Learning Benchmark on Tabular Data: Comparing Tree-Based Models and Neural Networks

TL;DR

This work addresses the challenge of applying federated learning to tabular data by benchmarking three federated tree-based models and three DNNs across 10 public datasets under varied non-IID partitions. It uses horizontal FL with label, feature, and quantity skew to compare performance, highlighting that federated boosted trees, notably Federated XGBoost, consistently outperform federated neural networks. The study also shows that tree-based federated methods maintain advantages as the number of participating clients grows, and TBMs typically outperform parametric models. The findings suggest TBMs offer robust, scalable, and privacy-preserving benefits for tabular data in FL deployments, guiding model selection and future research on partition-robust and efficient FL strategies.

Abstract

Federated Learning (FL) has lately gained traction as it addresses how machine learning models train on distributed datasets. FL was designed for parametric models, namely Deep Neural Networks (DNNs).Thus, it has shown promise on image and text tasks. However, FL for tabular data has received little attention. Tree-Based Models (TBMs) have been considered to perform better on tabular data and they are starting to see FL integrations. In this study, we benchmark federated TBMs and DNNs for horizontal FL, with varying data partitions, on 10 well-known tabular datasets. Our novel benchmark results indicates that current federated boosted TBMs perform better than federated DNNs in different data partitions. Furthermore, a federated XGBoost outperforms all other models. Lastly, we find that federated TBMs perform better than federated parametric models, even when increasing the number of clients significantly.
Paper Structure (17 sections, 6 figures, 6 tables)

This paper contains 17 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Tabular data, each row is a unique observation and the columns indicate features. Values can be numerical and categorical.
  • Figure 2: Architecture of XGBoost wang2019hybrid. Using so called "weak learners", XGBoost combines predictions these learners to output a final prediction.
  • Figure 3: Models' test accuracy for 3, 5, 10, 15, 25 and 50 clients on Heart Disease dataset in homogeneous setting.
  • Figure 4: Models' test accuracy for 3, 5, 10, 15, 25 and 50 clients on Adult dataset in homogeneous setting.
  • Figure 5: Models' test accuracy for 3, 5, 10, 15, 25 and 50 clients on FEMNIST dataset in homogeneous setting.
  • ...and 1 more figures