Table of Contents
Fetching ...

WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang

TL;DR

This paper proposes a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station and mobile devices in wireless networks and develops a performance metric for WDMoE-based LLMs.

Abstract

Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.

WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

TL;DR

This paper proposes a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station and mobile devices in wireless networks and develops a performance metric for WDMoE-based LLMs.

Abstract

Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.

Paper Structure

This paper contains 25 sections, 24 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: (a) MoE-based LLMs architecturelepikhin2020gshard; (b) The proposed WDMoE-based LLMs system model.
  • Figure 2: Expert network structure.
  • Figure 3: An example of waiting in the attention mechanism.
  • Figure 4: The data stream, expert selection and bandwidth allocation in WDMoE. When a user sends a prompt to WDMoE, it is preprocessed, embedded, and subjected to attention operations either locally or at BS, depending on user preference. Each token's embedding is analyzed by a gating network to assign weights to each expert. WDMoE dynamically adjusts these weights, the number of experts, and optimize the bandwidth allocation based on gating network output and wireless channel conditions. The token embeddings are sent to the respective devices for processing by expert networks. Once processed, the results are sent back to BS, where they are weighted, summed, and then transformed from embeddings into words.
  • Figure 5: The latency per batch data versus total bandwidth based on ARC-C dataset.
  • ...and 5 more figures