Table of Contents
Fetching ...

FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting

Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

TL;DR

FLEx introduces Federated LLMs with Personalized Experts for MoE-based models, decoupling shared dense parameters from frozen pretrained experts to reduce communication and prevent forgetting. Personalization is achieved by grafting a client-specific lightweight expert per MoE layer, selected via local reconstruction loss, and integrated with an adaptive gating mechanism that dynamically balances shared and personalized knowledge. Extensive experiments on non-IID instruction-tuning tasks show FLEx outperforms standard federated baselines and preserves world knowledge, achieving high ROUGE-L performance and strong MMLU scores, while remaining compatible with PEFT methods like LoRA. The framework demonstrates robust performance across pathological and Dirichlet non-IID regimes and domain-specific data, highlighting practical gains for privacy-preserving, personalized LLM deployment.

Abstract

Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE's dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based LLMs for efficient personalization. By aggregating only the shared non-expert parameters, FLEx significantly reduces communication overhead and preserves the world knowledge stored within the frozen pretrained experts. For personalization, we introduce a novel expert grafting mechanism that leverages dynamic sparsity to construct a client-specific expert from selected components of pretrained experts, tailored to local data. This grafted expert is then fine-tuned locally alongside the gating mechanism. This joint training enables the model to learn when to leverage the shared knowledge from frozen experts and when to employ the personalized one. Evaluations on diverse, non-IID instruction tuning datasets show that FLEx consistently outperforms federated baselines on average, while demonstrating strong knowledge preservation on the knowledge-driven benchmark MMLU. Our code is available at \href{https://anonymous.4open.science/r/FLEx-8F12}{\texttt{https://anonymous.4open.science/r/FLEx-8F12}}.

FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting

TL;DR

FLEx introduces Federated LLMs with Personalized Experts for MoE-based models, decoupling shared dense parameters from frozen pretrained experts to reduce communication and prevent forgetting. Personalization is achieved by grafting a client-specific lightweight expert per MoE layer, selected via local reconstruction loss, and integrated with an adaptive gating mechanism that dynamically balances shared and personalized knowledge. Extensive experiments on non-IID instruction-tuning tasks show FLEx outperforms standard federated baselines and preserves world knowledge, achieving high ROUGE-L performance and strong MMLU scores, while remaining compatible with PEFT methods like LoRA. The framework demonstrates robust performance across pathological and Dirichlet non-IID regimes and domain-specific data, highlighting practical gains for privacy-preserving, personalized LLM deployment.

Abstract

Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE's dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based LLMs for efficient personalization. By aggregating only the shared non-expert parameters, FLEx significantly reduces communication overhead and preserves the world knowledge stored within the frozen pretrained experts. For personalization, we introduce a novel expert grafting mechanism that leverages dynamic sparsity to construct a client-specific expert from selected components of pretrained experts, tailored to local data. This grafted expert is then fine-tuned locally alongside the gating mechanism. This joint training enables the model to learn when to leverage the shared knowledge from frozen experts and when to employ the personalized one. Evaluations on diverse, non-IID instruction tuning datasets show that FLEx consistently outperforms federated baselines on average, while demonstrating strong knowledge preservation on the knowledge-driven benchmark MMLU. Our code is available at \href{https://anonymous.4open.science/r/FLEx-8F12}{\texttt{https://anonymous.4open.science/r/FLEx-8F12}}.

Paper Structure

This paper contains 54 sections, 12 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: Challenges of applying traditional federated learning to MoE-based LLMs. As shown in the right figure, in an MoE layer, a router sparsely routes tokens from different clients to a selected subset of experts. This dense aggregation scheme conflicts with MoE's sparse activation mechanism, leading to prohibitive communication overhead and undermining the specialized knowledge of the experts.
  • Figure 2: Overview of FLEx framework. The FLEx framework begins by pruning personalized experts for each client using local data. The next step involves injecting personalized knowledge into the MoE layer via a gating mechanism, striking a balance between global knowledge sharing and local adaptation.
  • Figure 3: Evaluation of Helpfulness and Harmlessness on the Vicuna benchmark. Left: Score distributions for the base model, local training, and FLEx. Larger markers indicate average scores. Right: Average scores for all evaluated methods. FLEx achieves the highest scores on both metrics, with a notable improvement in harmlessness.
  • Figure 4: Our method significantly reduces communication overhead while simultaneously enhancing performance, a stark contrast to traditional federated learning approaches that treat MoE as dense models, leading to prohibitive communication costs.
  • Figure 5: Activation counts of experts in the Qwen1.5-MoE-A2.7B model evaluated on the C4 dataset.
  • ...and 1 more figures