PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks

Huiyou Zhan; Xuan Zhang; Haisheng Tan; Han Tian; Dongping Yong; Junyang Zhang; Xiang-Yang Li

PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks

Huiyou Zhan, Xuan Zhang, Haisheng Tan, Han Tian, Dongping Yong, Junyang Zhang, Xiang-Yang Li

TL;DR

PICE is proposed and implemented, an LLM serving system with semantic-level cloud-edge collaboration, enhancing inference throughput and quality through dynamic inference task scheduling, ensemble learning, and parallel edge inference.

Abstract

Large language models (LLMs), while driving a new wave of interactive AI applications across numerous domains, suffer from high inference costs and heavy cloud dependency. Motivated by the redundancy phenomenon in linguistics, we propose a progressive inference paradigm over cloud and edge, i.e., firstly generating the sketch of the answer by LLMs at cloud, and then conducting parallel extension to fill in details by small models (SLMs) at edge. Progressive inference offers potential benefits to improve throughput and reduce inference latency while facing key implementation challenges, including decreased response quality from SLMs, a tradeoff between the brevity and comprehensiveness of sketches, as well as increased latency caused by network transmission and edge inference. In this work, we propose and implement PICE, an LLM serving system with semantic-level cloud-edge collaboration, enhancing inference throughput and quality through dynamic inference task scheduling, ensemble learning, and parallel edge inference. Extensive testbed experiments illustrate that our approach achieves $1.5-2\times$ throughput enhancement and up to 43% latency reduction, while also potentially enhancing the quality compared to SOTA systems.

PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks

TL;DR

Abstract

throughput enhancement and up to 43% latency reduction, while also potentially enhancing the quality compared to SOTA systems.

Paper Structure (22 sections, 7 equations, 14 figures, 4 tables, 2 algorithms)

This paper contains 22 sections, 7 equations, 14 figures, 4 tables, 2 algorithms.

Introduction
Background and Motivation
LLM Services on the Cloud and Edge
Our idea: Cloud-edge Collaborative Progressive Inference
Challenges of Progressive Inference
PICE Overview
Key Design
Dynamic Scheduler
Optimization Objective
Cloud-side scheduling
Job dispatching
Edge-side scheduling
Execution Optimizer
Ensemble Learning
Model Fine-tuning
...and 7 more sections

Figures (14)

Figure 1: Existing inference paradigms vs. progressive inference.
Figure 2: The conditional probability and variances of the Qwen2.5-72B, Qwen2.5-7B, and Qwen2.5-1.5B models across different tokens. Lower variances indicate closer agreement among the models' output distributions.
Figure 3: The horizontal axis represents the max tokens of the LLM's response, while the vertical axis denotes the throughput of the serving system. The throughput is measured in the number of queries processed per minute (#queries/min).
Figure 4: The overview and workflow of PICE.
Figure 5: The design of PICE's fine-tuning component.
...and 9 more figures

PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks

TL;DR

Abstract

PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (14)