Table of Contents
Fetching ...

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

TL;DR

This work tackles the challenge of running large language models on resource-constrained edge devices by proposing a distributed on-device inference framework based on tensor parallelism and over-the-air (AirComp) all-reduce to dramatically cut communication latency. A two-stage optimization is developed to cope with mixed timescales: short-term transceiver design via SDR and long-term model assignment via stochastic SCA, with convergence guarantees to a stationary point. The approach is extended to multi-antenna edge devices to exploit spatial multiplexing, and extensive simulations on LLaMA2/3 models demonstrate up to 5x speedups and improved inference accuracy compared to baselines, including centralized inference. The results highlight practical potential for latency-sensitive edge deployments, paving the way for scalable, privacy-preserving distributed LLM inference in wireless networks, and the authors provide an open-source implementation for reproducibility.

Abstract

Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To address this issue, we propose a novel distributed on-device LLM inference framework that leverages tensor parallelism to partition the neural network tensors (e.g., weight matrices) of one LLM across multiple edge devices for collaborative inference. A key challenge in tensor parallelism is the frequent all-reduce operations for aggregating intermediate layer outputs across participating devices, which incurs significant communication overhead. To alleviate this bottleneck, we propose an over-the-air computation (AirComp) approach that harnesses the analog superposition property of wireless multiple-access channels to perform fast all-reduce steps. To utilize the heterogeneous computational capabilities of edge devices and mitigate communication distortions, we investigate a joint model assignment and transceiver optimization problem to minimize the average transmission error. The resulting mixed-timescale stochastic non-convex optimization problem is intractable, and we propose an efficient two-stage algorithm to solve it. Moreover, we prove that the proposed algorithm converges almost surely to a stationary point of the original problem. Comprehensive simulation results will show that the proposed framework outperforms existing benchmark schemes, achieving up to 5x inference speed acceleration and improving inference accuracy.

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

TL;DR

This work tackles the challenge of running large language models on resource-constrained edge devices by proposing a distributed on-device inference framework based on tensor parallelism and over-the-air (AirComp) all-reduce to dramatically cut communication latency. A two-stage optimization is developed to cope with mixed timescales: short-term transceiver design via SDR and long-term model assignment via stochastic SCA, with convergence guarantees to a stationary point. The approach is extended to multi-antenna edge devices to exploit spatial multiplexing, and extensive simulations on LLaMA2/3 models demonstrate up to 5x speedups and improved inference accuracy compared to baselines, including centralized inference. The results highlight practical potential for latency-sensitive edge deployments, paving the way for scalable, privacy-preserving distributed LLM inference in wireless networks, and the authors provide an open-source implementation for reproducibility.

Abstract

Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To address this issue, we propose a novel distributed on-device LLM inference framework that leverages tensor parallelism to partition the neural network tensors (e.g., weight matrices) of one LLM across multiple edge devices for collaborative inference. A key challenge in tensor parallelism is the frequent all-reduce operations for aggregating intermediate layer outputs across participating devices, which incurs significant communication overhead. To alleviate this bottleneck, we propose an over-the-air computation (AirComp) approach that harnesses the analog superposition property of wireless multiple-access channels to perform fast all-reduce steps. To utilize the heterogeneous computational capabilities of edge devices and mitigate communication distortions, we investigate a joint model assignment and transceiver optimization problem to minimize the average transmission error. The resulting mixed-timescale stochastic non-convex optimization problem is intractable, and we propose an efficient two-stage algorithm to solve it. Moreover, we prove that the proposed algorithm converges almost surely to a stationary point of the original problem. Comprehensive simulation results will show that the proposed framework outperforms existing benchmark schemes, achieving up to 5x inference speed acceleration and improving inference accuracy.

Paper Structure

This paper contains 34 sections, 5 theorems, 58 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

For a given aggregation beamformer $\bf{a}$, the transmission MSE is minimized by using the zero-forcing precoders $b_n^* =\frac{1}{\bf{a}^{\mathsf{H}}\bf{h}_n}, \forall n$.

Figures (6)

  • Figure 1: An illustration of the distributed on-device LLM inference system, showing the system workflow and visualizing tensor parallelism for (a) MLP and (b) self-attention layers.
  • Figure 2: Illustration of MLP matrix multiplication for conventional unpartitioned approach and tensor parallelism with two devices.
  • Figure 3: Block diagram of Algorithm 1
  • Figure 4: Convergence of Algorithm 1 for the scenarios of single-antenna devices (Top) and multi-antenna devices (Bottom).
  • Figure 5: The average MSE (a), perplexity (b), and average generation time (c) versus the number of edge devices for the scenario of single-antenna d evices.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Lemma 1
  • proof
  • Lemma 2
  • Remark 1
  • Remark 2
  • Lemma 3
  • proof
  • Definition 1
  • Theorem 1
  • proof
  • ...and 2 more