Table of Contents
Fetching ...

Collaborative Inference for Large Models with Task Offloading and Early Exiting

Zuan Xie, Yang Xu, Hongli Xu, Yunming Liao, Zhiyuan Yao

TL;DR

This work tackles running large-scale models at the network edge by combining model partitioning and early exiting to enable collaborative inference across multiple edge servers. The authors develop DTO-EE, a distributed framework that jointly optimizes the offloading strategy and exit confidence thresholds, grounded in a convex optimization formulation with an exterior-point penalty and supported by two subroutines, DTO-R and DTO-O, for receivers and offloaders respectively. They prove convergence and demonstrate through simulations that DTO-EE reduces average task delay by 21-41% and improves inference accuracy by 1-4% across heterogeneous and dynamic edge environments, outperforming several baselines. The approach offers a scalable, real-time solution for deploying large DL models at the edge, with significant implications for latency-sensitive 5G MEC applications.

Abstract

In 5G smart cities, edge computing is employed to provide nearby computing services for end devices, and the large-scale models (e.g., GPT and LLaMA) can be deployed at the network edge to boost the service quality. However, due to the constraints of memory size and computing capacity, it is difficult to run these large-scale models on a single edge node. To meet the resource constraints, a large-scale model can be partitioned into multiple sub-models and deployed across multiple edge nodes. Then tasks are offloaded to the edge nodes for collaborative inference. Additionally, we incorporate the early exit mechanism to further accelerate inference. However, the heterogeneous system and dynamic environment will significantly affect the inference efficiency. To address these challenges, we theoretically analyze the coupled relationship between task offloading strategy and confidence thresholds, and develop a distributed algorithm, termed DTO-EE, based on the coupled relationship and convex optimization. DTO-EE enables each edge node to jointly optimize its offloading strategy and the confidence threshold, so as to achieve a promising trade-off between response delay and inference accuracy. The experimental results show that DTO-EE can reduce the average response delay by 21%-41% and improve the inference accuracy by 1%-4%, compared to the baselines.

Collaborative Inference for Large Models with Task Offloading and Early Exiting

TL;DR

This work tackles running large-scale models at the network edge by combining model partitioning and early exiting to enable collaborative inference across multiple edge servers. The authors develop DTO-EE, a distributed framework that jointly optimizes the offloading strategy and exit confidence thresholds, grounded in a convex optimization formulation with an exterior-point penalty and supported by two subroutines, DTO-R and DTO-O, for receivers and offloaders respectively. They prove convergence and demonstrate through simulations that DTO-EE reduces average task delay by 21-41% and improves inference accuracy by 1-4% across heterogeneous and dynamic edge environments, outperforming several baselines. The approach offers a scalable, real-time solution for deploying large DL models at the edge, with significant implications for latency-sensitive 5G MEC applications.

Abstract

In 5G smart cities, edge computing is employed to provide nearby computing services for end devices, and the large-scale models (e.g., GPT and LLaMA) can be deployed at the network edge to boost the service quality. However, due to the constraints of memory size and computing capacity, it is difficult to run these large-scale models on a single edge node. To meet the resource constraints, a large-scale model can be partitioned into multiple sub-models and deployed across multiple edge nodes. Then tasks are offloaded to the edge nodes for collaborative inference. Additionally, we incorporate the early exit mechanism to further accelerate inference. However, the heterogeneous system and dynamic environment will significantly affect the inference efficiency. To address these challenges, we theoretically analyze the coupled relationship between task offloading strategy and confidence thresholds, and develop a distributed algorithm, termed DTO-EE, based on the coupled relationship and convex optimization. DTO-EE enables each edge node to jointly optimize its offloading strategy and the confidence threshold, so as to achieve a promising trade-off between response delay and inference accuracy. The experimental results show that DTO-EE can reduce the average response delay by 21%-41% and improve the inference accuracy by 1%-4%, compared to the baselines.

Paper Structure

This paper contains 19 sections, 1 theorem, 26 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Lemma 1

If $P^{t} \neq P^{*}$, the updating $\Gamma$ gives a gradient descent direction of $R(P^{t})$ at point $P^{t}$, $\textit{i.e.}\xspace$, where $\nabla$ is the gradient operator, and $\langle a, b \rangle$ represents the inner product of vectors a and b.

Figures (9)

  • Figure 1: Illustration of collaborative inference of a large model with multiple edge nodes.
  • Figure 2: The overview of DTO-EE.
  • Figure 3: Inference performance of algorithms given different task arrival rates for ResNet101 on ImageNet.
  • Figure 4: Inference performance of algorithms given different task arrival rates for Bert on Tnews.
  • Figure 5: Inference performance with varying average computing resource for RestNet101 on ImageNet.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof