EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Mingjin Zhang; Jiannong Cao; Xiaoming Shen; Zeyang Cui

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Mingjin Zhang, Jiannong Cao, Xiaoming Shen, Zeyang Cui

TL;DR

EdgeShard addresses the challenge of running large language models without relying solely on centralized cloud resources by enabling collaborative inference across heterogeneous edge devices and cloud servers. It formulates a joint device selection and model partition problem and solves it with dynamic programming to optimize latency and throughput, supported by offline profiling and online scheduling. The approach is validated on a practical testbed with Llama2 models, showing up to 50% latency reduction and ~2x throughput gains over strong baselines, especially under memory and bandwidth constraints. This work demonstrates a scalable, privacy-preserving path to deploy and accelerate LLMs at the network edge while leveraging cloud compute when bandwidth permits, positioning EdgeShard as a flexible, adaptive LLM serving framework for real-world heterogeneous environments.

Abstract

Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 18 sections, 13 equations, 10 figures, 4 tables, 2 algorithms.

Introduction
Preliminaries and Motivations
Collaborative Edge Computing for LLMs
Optimize LLM Inference
Optimize LLM inference latency
Optimize LLM inference throughput
Experimental Evaluation
Experimental Setup
Overall Evaluation
Effects of Bandwidth
Effects of Source Node
Effects of Pipeline Execution strategy
Related Work
Edge Computing for Efficient LLM
LLM for Optimizing Edge Computing
...and 3 more sections

Figures (10)

Figure 1: Collaborative edge computing integrates the computing resources of ubiquitous geo-distributed devices for jointly performing computational tasks, with great benefits of enlarged resource pool, low-latency data processing, flexible device access, and expanded service region.
Figure 2: LLM inference has an autoregressive nature.
Figure 3: Framework of EdgeLLM. It consists of three stages: offline profiling, task scheduling optimization, and online collaborative LLM inference.
Figure 4: Collaborative LLM inference
Figure 5: Different pipeline execution strategies of EdgeShard. EdgeShard-No-bubbles reduces device idle time to improve throughput by allowing immediate token generation of a micro-batch without waiting for other micro-batches.
...and 5 more figures

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

TL;DR

Abstract

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Authors

TL;DR

Abstract

Table of Contents

Figures (10)