The Key of Parameter Skew in Federated Learning

Junfeng Liao; Sifan Wang; Ye Yuan; Riquan Zhang

The Key of Parameter Skew in Federated Learning

Junfeng Liao, Sifan Wang, Ye Yuan, Riquan Zhang

TL;DR

Federated Learning under non-IID data suffers from parameter skew in local models, which biases the global parameter estimation. The authors propose FedPake, a dispersion-aware aggregation that uses the coefficient of variation to separate high-dispersion from low-dispersion parameters and introduces Micro-Class and Macro-Class to weight high-dispersion updates during global aggregation. The global objective is $ \text{Loss} = \frac{\sum_{i=1}^N |D^i| \\cdot \mathbb{E}_{(X^i, Y^i) \sim D^i}[\mathcal{L}(f(X^i), Y^i)]}{\sum_{i=1}^N |D^i|} $. Empirical results on CIFAR-10/100 and Tiny-ImageNet show FedPake outperforms eight baselines by up to about 4.7 percentage points in test accuracy and converges faster with modest additional computation. This work provides a principled approach to mitigating parameter skew, improving generalization in heterogeneous FL settings and guiding future discrepancy-aware aggregation, with potential applicability to larger models.

Abstract

Federated Learning (FL) has emerged as an excellent solution for performing deep learning on different data owners without exchanging raw data. However, statistical heterogeneity in FL presents a key challenge, leading to a phenomenon of skewness in local model parameter distributions that researchers have largely overlooked. In this work, we propose the concept of parameter skew to describe the phenomenon that can substantially affect the accuracy of global model parameter estimation. Additionally, we introduce FedSA, an aggregation strategy to obtain a high-quality global model, to address the implication from parameter skew. Specifically, we categorize parameters into high-dispersion and low-dispersion groups based on the coefficient of variation. For high-dispersion parameters, Micro-Classes (MIC) and Macro-Classes (MAC) represent the dispersion at the micro and macro levels, respectively, forming the foundation of FedSA. To evaluate the effectiveness of FedSA, we conduct extensive experiments with different FL algorithms on three computer vision datasets. FedSA outperforms eight state-of-the-art baselines by about 4.7% in test accuracy.

The Key of Parameter Skew in Federated Learning

TL;DR

. Empirical results on CIFAR-10/100 and Tiny-ImageNet show FedPake outperforms eight baselines by up to about 4.7 percentage points in test accuracy and converges faster with modest additional computation. This work provides a principled approach to mitigating parameter skew, improving generalization in heterogeneous FL settings and guiding future discrepancy-aware aggregation, with potential applicability to larger models.

Abstract

Paper Structure (25 sections, 11 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 11 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Traditional Federated Learning
Personalized Federated Learning
Methodology
Problem Statement
Our Method
Experiments
Experiment Setting
Performance Comparison and Analysis
Model Analysis
Conclusion and Discussions
Conclusion
Discussions
Dataset
...and 10 more sections

Figures (9)

Figure 1: T-SNE visualizations illustrate the changes in the distribution of local model parameters during training under both IID and non-IID. Under IID, parameters gradually converge during training, whereas under non-IID, they remain scattered. The experiments are conducted with ResNet-18 he2016deep on CIFAR-10 dataset.
Figure 2: The distribution of parameters of local models, FedPake(ours), MOON, FedALA, and FedAVG. The parameters in the figure are from ResNet-18 trained with FedPake and other methods on CIFAR-10. In the figure, the local model parameter's distribution is skewed, which clearly illustrates the presence of parameter skew. Our method aligns closely with the main peak of the distribution, indicating that FedPake effectively captures the central tendency under parameter skew. In contrast, other methods fail to address this issue, resulting in a deviation from the main peak.
Figure 3: The architecture of FedPake. We input the local models into Parameter Division to obtain high-dispersion and low-dispersion parameters. For the high-dispersion, we calculate the final values using a weighted average, while average values serve as final values for the low-dispersion. Specifically, for the high-dispersion, our method computes weight $\alpha$ of each based on Micro-Class and Marco-Class. 3x3 convolutional kernel backbone are an instance.
Figure 4: The effectiveness of each hyperparameter. On CIFAR-10/100, we demonstrate the training of FedPake with various hyperparameter values, including $\lambda$, $C$, and $S$. And others follow the default experiment setting. Red line is the optimal hyperparameter setting.
Figure 5: The performance of FedPake(ours) and baselines on CIFAR-10. The top figure presents the training loss of seven FL methods, and the bottom shows their test accuracy. Experiments are conducted under default settings.
...and 4 more figures

The Key of Parameter Skew in Federated Learning

TL;DR

Abstract

The Key of Parameter Skew in Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)