Table of Contents
Fetching ...

Federated Learning for distribution skewed data using sample weights

Hung Nguyen, Peiyuan Wu, Morris Chang

TL;DR

This work tackles the non-IID feature-skew problem in federated learning by introducing FedDisk, a two-phase framework that uses MADE-based density estimation to implicitly learn global and local data distributions and derive sample weights for reweighting local losses. By estimating the density ratio through a binary classifier trained on MADE outputs, per-example weights adjust each client’s contribution to align with the global distribution, enabling faster convergence and reduced communication when training a shared classifier. The approach is validated on simulated MNIST and real non-IID FEMNIST and Chest-Xray datasets, showing superior accuracy and significantly lower communication cost compared to strong baselines, with a formal privacy leakage analysis indicating bounded information exposure that decreases with more clients. Overall, FedDisk offers a privacy-preserving, communication-efficient pathway to mitigate distribution skew in FL without sharing raw data, and opens avenues for further optimization of density models and robustness to attacks.

Abstract

One of the most challenging issues in federated learning is that the data is often not independent and identically distributed (nonIID). Clients are expected to contribute the same type of data and drawn from one global distribution. However, data are often collected in different ways from different resources. Thus, the data distributions among clients might be different from the underlying global distribution. This creates a weight divergence issue and reduces federated learning performance. This work focuses on improving federated learning performance for skewed data distribution across clients. The main idea is to adjust the client distribution closer to the global distribution using sample weights. Thus, the machine learning model converges faster with higher accuracy. We start from the fundamental concept of empirical risk minimization and theoretically derive a solution for adjusting the distribution skewness using sample weights. To determine sample weights, we implicitly exchange density information by leveraging a neural network-based density estimation model, MADE. The clients data distribution can then be adjusted without exposing their raw data. Our experiment results on three real-world datasets show that the proposed method not only improves federated learning accuracy but also significantly reduces communication costs compared to the other experimental methods.

Federated Learning for distribution skewed data using sample weights

TL;DR

This work tackles the non-IID feature-skew problem in federated learning by introducing FedDisk, a two-phase framework that uses MADE-based density estimation to implicitly learn global and local data distributions and derive sample weights for reweighting local losses. By estimating the density ratio through a binary classifier trained on MADE outputs, per-example weights adjust each client’s contribution to align with the global distribution, enabling faster convergence and reduced communication when training a shared classifier. The approach is validated on simulated MNIST and real non-IID FEMNIST and Chest-Xray datasets, showing superior accuracy and significantly lower communication cost compared to strong baselines, with a formal privacy leakage analysis indicating bounded information exposure that decreases with more clients. Overall, FedDisk offers a privacy-preserving, communication-efficient pathway to mitigate distribution skew in FL without sharing raw data, and opens avenues for further optimization of density models and robustness to attacks.

Abstract

One of the most challenging issues in federated learning is that the data is often not independent and identically distributed (nonIID). Clients are expected to contribute the same type of data and drawn from one global distribution. However, data are often collected in different ways from different resources. Thus, the data distributions among clients might be different from the underlying global distribution. This creates a weight divergence issue and reduces federated learning performance. This work focuses on improving federated learning performance for skewed data distribution across clients. The main idea is to adjust the client distribution closer to the global distribution using sample weights. Thus, the machine learning model converges faster with higher accuracy. We start from the fundamental concept of empirical risk minimization and theoretically derive a solution for adjusting the distribution skewness using sample weights. To determine sample weights, we implicitly exchange density information by leveraging a neural network-based density estimation model, MADE. The clients data distribution can then be adjusted without exposing their raw data. Our experiment results on three real-world datasets show that the proposed method not only improves federated learning accuracy but also significantly reduces communication costs compared to the other experimental methods.
Paper Structure (29 sections, 17 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 17 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: FedDisk Framework: The proposed framework has two phases. First, local and global probability density functions ($p(x),q(x)$) are estimated via MADE models leveraging FL procedures. Then, the sample weights $\alpha$ are computed by approximating density ratio via class probability estimation. Second, the machine learning tasks (e.g., classification) can be performed similar to a typical FL method (i.e., FedAvg) with the sample weights acquired from phase 1.
  • Figure 2: Example images from MNIST, FEMNIST and Chest Xray datasets. They are collected from different sources and carried a veraity of resolutions, styles or conditions.
  • Figure 3: Global model's average test accuracy during aggregation process. For MNIST dataset, clients' data were added noise with the mean of zero and variance of 0.3
  • Figure 4: Test accuracy percentiles, min, max and median plot of 100 clients for different datasets and methods.
  • Figure 5: Average validation and train losses during training the global MADE models. The training processes were stopped if the validation loss starts increasing.
  • ...and 2 more figures