WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

Nan Xue; Yaping Sun; Zhiyong Chen; Meixia Tao; Xiaodong Xu; Liang Qian; Shuguang Cui; Ping Zhang

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Ping Zhang

TL;DR

This work addresses privacy and latency challenges in deploying large language models over wireless networks by proposing WDMoE, a wireless distributed LLM framework based on Mixture of Experts. The MoE is decomposed so that the gating network and the preceding FFN reside at the base station, while the expert networks are distributed across edge devices, enabling parallel processing and avoiding bottlenecks from any single device. A training-free, latency-aware expert selection policy using Weight-to-Latency Ratio (WLR) balances model performance and end-to-end delay without retraining, crucially tuning participation via a threshold $\theta$ and dynamic selection of top-$k$ experts. Experiments across multiple LLMs and benchmarks show WDMoE can outperform large cloud-based models like Llama 2-70B while significantly reducing end-to-end latency in wireless settings, demonstrating the practicality of cooperative edge-device LLMs for real-world wireless applications. The approach offers a scalable path toward privacy-preserving, low-latency LLM inference in 6G-era networks, with potential impact on metaverse, autonomous systems, and intelligent networking scenarios.

Abstract

Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but how wireless communications can support LLMs has not been extensively studied. In this paper, we propose a wireless distributed LLMs paradigm based on Mixture of Experts (MoE), named WDMoE, deploying LLMs collaboratively across edge servers of base station (BS) and mobile devices in the wireless communications system. Specifically, we decompose the MoE layer in LLMs by deploying the gating network and the preceding neural network layer at BS, while distributing the expert networks across the devices. This arrangement leverages the parallel capabilities of expert networks on distributed devices. Moreover, to overcome the instability of wireless communications, we design an expert selection policy by taking into account both the performance of the model and the end-to-end latency, which includes both transmission delay and inference delay. Evaluations conducted across various LLMs and multiple datasets demonstrate that WDMoE not only outperforms existing models, such as Llama 2 with 70 billion parameters, but also significantly reduces end-to-end latency.

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

TL;DR

and dynamic selection of top-

experts. Experiments across multiple LLMs and benchmarks show WDMoE can outperform large cloud-based models like Llama 2-70B while significantly reducing end-to-end latency in wireless settings, demonstrating the practicality of cooperative edge-device LLMs for real-world wireless applications. The approach offers a scalable path toward privacy-preserving, low-latency LLM inference in 6G-era networks, with potential impact on metaverse, autonomous systems, and intelligent networking scenarios.

Abstract

Paper Structure (12 sections, 10 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 10 equations, 3 figures, 1 table, 1 algorithm.

Introduction
The WDMoE Framework
Gating Network of MoE-based LLMs
Distributed Deployment of WDMoE
Communication Model
Computing Model
End to End Inference Latency
Expert Selection Policy
Experiment Results
Experiment Settings
Performance Evaluation
Conclusion

Figures (3)

Figure 1: (a) MoE-based LLM architecture; (b) The proposed WDMoE system model.
Figure 2: The data stream and expert selection in WDMoE. When a user sends a prompt to an MoE-based LLM, it is preprocessed, embedded, and subjected to attention operations either locally or at BS, depending on user preference. Each token's embedding is then analyzed by a gating network to assign weights to each expert. WDMoE dynamically adjusts these weights and the number of experts based on gating network output and user channel conditions at the BS. The token embeddings are sent to the respective user devices for processing by expert networks. Once processed, the results are sent back to the BS, where they are weighted, summed, and then transformed from embeddings into words.
Figure 3: Performance and latency of WDMoE on various benchmarks.

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

TL;DR

Abstract

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (3)