Table of Contents
Fetching ...

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, Ran Zhang

TL;DR

The work tackles edge-based LLM inference by jointly optimizing batch scheduling and resource allocation in wireless edge networks. It formulates a throughput-maximization problem with memory, latency, and accuracy constraints, and introduces an optimal Depth-First Tree-Searching with Tree-Pruning (DFTSP) algorithm to solve it efficiently. Through simulations on multiple open-source LLMs and quantization schemes, DFTSP demonstrably improves throughput and dramatically reduces computational complexity compared to brute-force or static batching, validating the practicality of edge-enabled LLM inference. The study also introduces a perplexity-differential metric to balance quantization-induced accuracy loss with latency/throughput benefits, guiding practical deployment choices.

Abstract

Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

TL;DR

The work tackles edge-based LLM inference by jointly optimizing batch scheduling and resource allocation in wireless edge networks. It formulates a throughput-maximization problem with memory, latency, and accuracy constraints, and introduces an optimal Depth-First Tree-Searching with Tree-Pruning (DFTSP) algorithm to solve it efficiently. Through simulations on multiple open-source LLMs and quantization schemes, DFTSP demonstrably improves throughput and dramatically reduces computational complexity compared to brute-force or static batching, validating the practicality of edge-enabled LLM inference. The study also introduces a perplexity-differential metric to balance quantization-induced accuracy loss with latency/throughput benefits, guiding practical deployment choices.

Abstract

Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.
Paper Structure (11 sections, 15 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 15 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: The wireless edge intelligence network and workflows for LLM inference. An LLM is deployed with model quantization.
  • Figure 2: Timeline for LLM edge inference.
  • Figure 3: LLM inference procedure.
  • Figure 4: An example of Algorithm 1. $z=6$, $d=10$, $N=3$, $\left|\mathcal{F}_{N_1}\right|=\left|\mathcal{F}_{N_2}\right|=4, \left|\mathcal{F}_{N_3}\right|=2$. All paths meet the memory and latency constraints. Inside each dotted circle, the number represents cumulative uplink bandwidth of requests in $\mathcal{S}_k$.
  • Figure 5: Throughput under different batching schemes.
  • ...and 1 more figures