Table of Contents
Fetching ...

pMixFed: Efficient Personalized Federated Learning through Adaptive Layer-Wise Mixup

Yasaman Saadati, Mohammad Rostami, M. Hadi Amini

Abstract

Traditional Federated Learning (FL) methods encounter significant challenges when dealing with heterogeneous data and providing personalized solutions for non-IID scenarios. Personalized Federated Learning (PFL) approaches aim to address these issues by balancing generalization and personalization, often through parameter decoupling or partial models that freeze some neural network layers for personalization while aggregating other layers globally. However, existing methods still face challenges of global-local model discrepancy, client drift, and catastrophic forgetting, which degrade model accuracy. To overcome these limitations, we propose $\textit{pMixFed}$, a dynamic, layer-wise PFL approach that integrates $\textit{mixup}$ between shared global and personalized local models. Our method introduces an adaptive strategy for partitioning between personalized and shared layers, a gradual transition of personalization degree to enhance local client adaptation, improved generalization across clients, and a novel aggregation mechanism to mitigate catastrophic forgetting. Extensive experiments demonstrate that pMixFed outperforms state-of-the-art PFL methods, showing faster model training, increased robustness, and improved handling of data heterogeneity under different heterogeneous settings.

pMixFed: Efficient Personalized Federated Learning through Adaptive Layer-Wise Mixup

Abstract

Traditional Federated Learning (FL) methods encounter significant challenges when dealing with heterogeneous data and providing personalized solutions for non-IID scenarios. Personalized Federated Learning (PFL) approaches aim to address these issues by balancing generalization and personalization, often through parameter decoupling or partial models that freeze some neural network layers for personalization while aggregating other layers globally. However, existing methods still face challenges of global-local model discrepancy, client drift, and catastrophic forgetting, which degrade model accuracy. To overcome these limitations, we propose , a dynamic, layer-wise PFL approach that integrates between shared global and personalized local models. Our method introduces an adaptive strategy for partitioning between personalized and shared layers, a gradual transition of personalization degree to enhance local client adaptation, improved generalization across clients, and a novel aggregation mechanism to mitigate catastrophic forgetting. Extensive experiments demonstrate that pMixFed outperforms state-of-the-art PFL methods, showing faster model training, increased robustness, and improved handling of data heterogeneity under different heterogeneous settings.
Paper Structure (54 sections, 4 theorems, 40 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 54 sections, 4 theorems, 40 equations, 11 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

For any round $t$, the pMixFed update (eq:pmix-agg) similar with a FedSGD step of size $\eta_g$ if and only if

Figures (11)

  • Figure 1: Discrepancy between personalized and global shared layers in Partial PFL: (1) The global model, $G^t$, is constructed by aggregating asynchronous local updates from clients, denoted as $L^t_i$, $L^t_j$, and $L^t_k$. (2),(3) In communication round $t$, available clients $i$ and $j$ aggregate shared parameters to produce the updated global model $G^{t+1}$, while the personalized parameters, such as $L^t_k$, remain unchanged for unavailable clients. (4) This integration of distinct models, $G^{t+1}$ and $L^t_k$, induces inconsistencies in the overall model updates. (Bottom) During the joint training of generalized and personalized models, the gradient updates from the generalized layers are impacted by the gradients from personalized layers, resulting in catastrophic forgetting, performance drop and slower convergence rates.
  • Figure 2: Workflow of pMixFed: Mixup is used in two stages. 1-Broadcasting: when transferring knowledge to local models, the frozen personalized model $L_k^{(t)}$ is mixed up with global model $G^{(t)}$ according to the adaptive mix factor $\mu_k^{(t)}$ which determines layer-wise mixup degree $\lambda_i$ for layer $i$. 2-Aggregation: The updated global model $G^{(t+1)}$ is generated through applying Mixup between the updated local model $L^{(t+1)}$ and the current global model $G'^{(t)}$ state.
  • Figure 3: Average Test accuracy in different global communication rounds for pMixFed and other PFL baselines experimented on CIFAR10 and CIFAR100 where N=100, C=10%. More details are discussed in Section \ref{['analytic_experiment']}
  • Figure 4: (a) The accuracy drop in FedSim occurred due to the vanishing gradient at round 42. (b) accuracy declines at round 10 in FedAlt due to the introduction of 5 new participants. Applying adaptive mixup solely between corresponding global and local shared layers mitigates the accuracy drop.
  • Figure 5: (a) Effect on learning rate on average test accuracy(out-of-sample) gap and on the cold-start users. (b) The comparison between test accuracy on the cold-start-users with different Mix factor functions. (Dynamic-only). In this scenario we used a fixed $Mu$ for all communication rounds. (Sigmoid). The original updating strategy based on a sigmoid function. (Gradual). A simple linear function has been adapted for updating $Mu$. (Random) Mixup degree $\lambda_i$ is selected randomly from $\beta$ distribution.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 1: Coefficient matching with FedSGD
  • Lemma 2: One-step descent
  • Theorem 3: Nonconvex rate
  • Theorem 4: Strongly-convex case