Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Kai Zhang; Hengtao He; Shenghui Song; Jun Zhang; Khaled B. Letaief

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

TL;DR

This work tackles the challenge of running large language models on resource-constrained edge devices by proposing a distributed on-device inference framework based on tensor parallelism and over-the-air (AirComp) all-reduce to dramatically cut communication latency. A two-stage optimization is developed to cope with mixed timescales: short-term transceiver design via SDR and long-term model assignment via stochastic SCA, with convergence guarantees to a stationary point. The approach is extended to multi-antenna edge devices to exploit spatial multiplexing, and extensive simulations on LLaMA2/3 models demonstrate up to 5x speedups and improved inference accuracy compared to baselines, including centralized inference. The results highlight practical potential for latency-sensitive edge deployments, paving the way for scalable, privacy-preserving distributed LLM inference in wireless networks, and the authors provide an open-source implementation for reproducibility.

Abstract

Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To address this issue, we propose a novel distributed on-device LLM inference framework that leverages tensor parallelism to partition the neural network tensors (e.g., weight matrices) of one LLM across multiple edge devices for collaborative inference. A key challenge in tensor parallelism is the frequent all-reduce operations for aggregating intermediate layer outputs across participating devices, which incurs significant communication overhead. To alleviate this bottleneck, we propose an over-the-air computation (AirComp) approach that harnesses the analog superposition property of wireless multiple-access channels to perform fast all-reduce steps. To utilize the heterogeneous computational capabilities of edge devices and mitigate communication distortions, we investigate a joint model assignment and transceiver optimization problem to minimize the average transmission error. The resulting mixed-timescale stochastic non-convex optimization problem is intractable, and we propose an efficient two-stage algorithm to solve it. Moreover, we prove that the proposed algorithm converges almost surely to a stationary point of the original problem. Comprehensive simulation results will show that the proposed framework outperforms existing benchmark schemes, achieving up to 5x inference speed acceleration and improving inference accuracy.

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

TL;DR

Abstract

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)