FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
Hanzi Mei, Dongqi Cai, Ao Zhou, Shangguang Wang, Mengwei Xu
TL;DR
FedMoE tackles heterogeneous Federated Learning for Large Language Models by integrating a sparsely activated Mixture-of-Experts backbone into a two-stage fine-tuning pipeline. The method first identifies client-specific sub-MoEs through activation-based heuristics and then performs federated training with modular aggregation and expert-recommendation-driven adjustments to refine structure and sharing. Empirical results show FedMoE improves task performance while reducing memory footprint and communication compared to baselines, and exhibits robust convergence across complex cross-task scenarios. The approach enables practical FedLLMs on edge devices by dynamically scaling capacity and sharing only the most relevant experts, thereby balancing personalization with global knowledge transfer.
Abstract
As Large Language Models (LLMs) push the boundaries of AI capabilities, their demand for data is growing. Much of this data is private and distributed across edge devices, making Federated Learning (FL) a de-facto alternative for fine-tuning (i.e., FedLLM). However, it faces significant challenges due to the inherent heterogeneity among clients, including varying data distributions and diverse task types. Towards a versatile FedLLM, we replace traditional dense model with a sparsely-activated Mixture-of-Experts (MoE) architecture, whose parallel feed-forward networks enable greater flexibility. To make it more practical in resource-constrained environments, we present FedMoE, the efficient personalized FL framework to address data heterogeneity, constructing an optimal sub-MoE for each client and bringing the knowledge back to global MoE. FedMoE is composed of two fine-tuning stages. In the first stage, FedMoE simplifies the problem by conducting a heuristic search based on observed activation patterns, which identifies a suboptimal submodel for each client. In the second stage, these submodels are distributed to clients for further training and returned for server aggregating through a novel modular aggregation strategy. Meanwhile, FedMoE progressively adjusts the submodels to optimal through global expert recommendation. Experimental results demonstrate the superiority of our method over previous personalized FL methods.
