Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
Fahao Chen, Jie Wan, Peng Li, Zhou Su, Dongxiao Yu
TL;DR
Flux addresses the challenge of federated fine-tuning for sparsely activated MoE-based LLMs on resource-constrained devices by introducing a three-pronged approach: quantization-based local expert activation profiling, adaptive layer-aware merging of non-tuning experts, and a dynamic expert role assignment strategy guided by gradient-based utility with exploration-exploitation. The system employs a parameter-server federated setup with stale profiling to hide profiling overhead and an adaptive budget allocation across layers to minimize error propagation during merging. Empirical results on LLaMA-MoE and DeepSeek-MoE show Flux achieving up to 4.75× time-to-accuracy speedups while preserving near-full MoE performance across several benchmarks, significantly outperforming offloading, quantization-based, and simple expert selection baselines. The work demonstrates practical viability for scalable, privacy-preserving fine-tuning of large MoE LLMs on heterogeneous devices, with broad implications for enterprise and research deployments.
Abstract
Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model quantization, computation offloading, or expert pruning. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system designed to enable federated fine-tuning of MoE-based LLMs across participants with constrained computing resources (e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX introduces three key innovations: (1) quantization-based local profiling to estimate expert activation with minimal overhead, (2) adaptive layer-aware expert merging to reduce resource consumption while preserving accuracy, and (3) dynamic expert role assignment using an exploration-exploitation strategy to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.
