Table of Contents
Fetching ...

Recursive Offloading for LLM Serving in Multi-tier Networks

Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Bo Gao, Jinda Lu, Zheming Yang, Tian Wen

TL;DR

RecServe introduces a recursive offloading framework for LLM serving in a heterogeneous device-edge-cloud network, using task-specific confidence measures and a sliding-window quantile threshold to decide when to process locally or offload to higher tiers. A historical confidence queue enables dynamic thresholding, allowing most tasks to complete at low tiers while only the most complex ones reach the cloud. The authors provide theoretical bounds on communication and computation costs and demonstrate through eight datasets that RecServe substantially reduces communication burden (over 50% in many cases) while maintaining competitive accuracy and BLEU scores compared to cloud-centric baselines. Ablation and robustness analyses show the method’s sensitivity to the quantile parameter, queue size, and cloud-model choice, with practical mechanisms for handling unavailability and budget-constrained serving. Overall, RecServe offers a scalable, adaptive framework for efficient LLM serving in multi-tier networks, aligning service quality with network and resource constraints.

Abstract

Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50\% compared to centralized cloud-based serving.

Recursive Offloading for LLM Serving in Multi-tier Networks

TL;DR

RecServe introduces a recursive offloading framework for LLM serving in a heterogeneous device-edge-cloud network, using task-specific confidence measures and a sliding-window quantile threshold to decide when to process locally or offload to higher tiers. A historical confidence queue enables dynamic thresholding, allowing most tasks to complete at low tiers while only the most complex ones reach the cloud. The authors provide theoretical bounds on communication and computation costs and demonstrate through eight datasets that RecServe substantially reduces communication burden (over 50% in many cases) while maintaining competitive accuracy and BLEU scores compared to cloud-centric baselines. Ablation and robustness analyses show the method’s sensitivity to the quantile parameter, queue size, and cloud-model choice, with practical mechanisms for handling unavailability and budget-constrained serving. Overall, RecServe offers a scalable, adaptive framework for efficient LLM serving in multi-tier networks, aligning service quality with network and resource constraints.

Abstract

Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50\% compared to centralized cloud-based serving.

Paper Structure

This paper contains 42 sections, 45 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Schematic diagram of RecServe.
  • Figure 2: Visualization of precision vs communication burden for multi-tier serving methods across eight datasets.
  • Figure 3: Comparison of RecServe and ColServe with different offload configurations.
  • Figure 4: Effect of maximum queue capability on RecServe’s inference accuracy ($\beta=0.1$) and communication burden.
  • Figure 5: Comparison of RecServe and ColServe with DeBERTa-large deployed on the cloud. For each method, the left bar shows the stacked communication burden across end, edge, and cloud tiers, while the right bar indicates the corresponding accuracy.
  • ...and 2 more figures