WDMoE: Wireless Distributed Large Language Models with Mixture of Experts
Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Ping Zhang
TL;DR
This work addresses privacy and latency challenges in deploying large language models over wireless networks by proposing WDMoE, a wireless distributed LLM framework based on Mixture of Experts. The MoE is decomposed so that the gating network and the preceding FFN reside at the base station, while the expert networks are distributed across edge devices, enabling parallel processing and avoiding bottlenecks from any single device. A training-free, latency-aware expert selection policy using Weight-to-Latency Ratio (WLR) balances model performance and end-to-end delay without retraining, crucially tuning participation via a threshold $\theta$ and dynamic selection of top-$k$ experts. Experiments across multiple LLMs and benchmarks show WDMoE can outperform large cloud-based models like Llama 2-70B while significantly reducing end-to-end latency in wireless settings, demonstrating the practicality of cooperative edge-device LLMs for real-world wireless applications. The approach offers a scalable path toward privacy-preserving, low-latency LLM inference in 6G-era networks, with potential impact on metaverse, autonomous systems, and intelligent networking scenarios.
Abstract
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but how wireless communications can support LLMs has not been extensively studied. In this paper, we propose a wireless distributed LLMs paradigm based on Mixture of Experts (MoE), named WDMoE, deploying LLMs collaboratively across edge servers of base station (BS) and mobile devices in the wireless communications system. Specifically, we decompose the MoE layer in LLMs by deploying the gating network and the preceding neural network layer at BS, while distributing the expert networks across the devices. This arrangement leverages the parallel capabilities of expert networks on distributed devices. Moreover, to overcome the instability of wireless communications, we design an expert selection policy by taking into account both the performance of the model and the end-to-end latency, which includes both transmission delay and inference delay. Evaluations conducted across various LLMs and multiple datasets demonstrate that WDMoE not only outperforms existing models, such as Llama 2 with 70 billion parameters, but also significantly reduces end-to-end latency.
