TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Zonghang Li; Wenjiao Feng; Mohsen Guizani; Hongfang Yu

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu

TL;DR

This paper tackles privacy-preserving LLM inference on edge devices by arguing that tensor parallelism outperforms pipeline/model parallelism for single-user edge scenarios. It introduces TPI-LLM, a compute- and memory-efficient tensor-parallel framework with a star-based allreduce and a sliding window memory scheduler to enable 70B-scale models on low-resource devices, while keeping prompts and generations on-device. Empirical results on emulated and real testbeds show over 80%–90% reductions in time-to-first-token and token latency versus baselines, and a peak per-device memory footprint as low as 3.1 GB for Llama 2-70B, thanks to the memory scheduler and KVCache partitioning. This approach enhances on-device privacy, reduces cloud dependency, and enables practical deployment of very large models on modest hardware through scalable edge collaboration.

Abstract

Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

TL;DR

Abstract

Paper Structure (24 sections, 6 theorems, 21 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 24 sections, 6 theorems, 21 equations, 10 figures, 9 tables, 2 algorithms.

Introduction
Observations and Motivations
TPI-LLM Framework with Sliding Window Memory Scheduling
The Parallel Framework Design of TPI-LLM System
Allreduce latency analysis
Sliding Window Memory Scheduling
Experiments
Overview of tpi-llm performance
Scaling over varying edge conditions
Comparison with benchmarks
Real case study
Conclusion
Appendix
Proof of proposition 2
A simple-to-use memory scheduler
...and 9 more sections

Key Result

Proposition 1

The bottleneck in allreduce is not network bandwidth, but link latency.

Figures (10)

Figure 1: Comparison of (a,b) tensor and model parallelism in terms of computational and communication time and (c) memory footprint each device with increasing tensor parallel nodes.
Figure 2: Overview of the TPI-LLM parallel framework.
Figure 3: Impact of link latency $\tau$.
Figure 4: An illustration of the sliding window memory scheduling. Blue blocks indicate the blocks currently executed, with numbered blocks for attention or FFN computing and unnumbered blocks for allreduce communication. Green blocks indicate loaded model weights. The dashed box represents the sliding window, with size 4 in this case.
Figure 5: Token latency over varying number of devices, CPU cores, and network bandwidth on Llama 2-70B.
...and 5 more figures

Theorems & Definitions (7)

Proposition 1
proof
Proposition 2
Proposition 3: Loose Steady Condition
Proposition 4: Tight Steady Condition
Proposition 5: Peak Memory Footprint
Proposition 6: Loose Steady Condition with Block Retention

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

TL;DR

Abstract

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (7)