Table of Contents
Fetching ...

PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference

Guanqiao Qu, Qian Chen, Xianhao Chen, Kaibin Huang, Yuguang Fang

TL;DR

This work tackles reducing end-to-end latency in wireless edge inference by exploiting parameter sharing across AI models to avoid redundant loading. It formulates a joint scheduling and bandwidth allocation problem, decouples it into a scheduling problem and a closed-form bandwidth policy, and analyzes a backbone-sharing special case with a dynamic-programming solution. For the general case, it proposes a greedy heuristic, both designed to maximize task throughput under GPU-memory, bandwidth, and latency constraints. Empirical results show substantial throughput gains over traditional, non-sharing strategies, highlighting the practical impact of strategic model loading and batch scheduling in edge AI deployments.

Abstract

By provisioning inference offloading services, edge inference drives the rapid growth of AI applications at network edge. However, how to reduce the inference latency remains a significant challenge. To address this issue, we develop a parameter-sharing AI model loading (PartialLoading) framework for multi-user edge inference, which exploits two key insights: 1) the majority of latency arises from loading AI models into server GPU memory, and 2) different AI models can share a significant number of parameters, for which redundant loading should be avoided. Towards this end, we formulate a joint multi-user scheduling and spectrum bandwidth allocation problem to maximize task throughput by exploiting shared parameter blocks across models. The intuition is to judiciously schedule user requests to reuse the shared parameter blocks between consecutively loaded models, thereby reducing model loading time substantially. To facilitate solution finding, we decouple the problem into two sub-problems, i.e., user scheduling and bandwidth allocation, showing that solving them sequentially leads to the solution to the original problem. Due to the NP-hardness of the problem, we first study an important special case called the "backbone-sharing" case, and design a dynamic programming-based algorithm to obtain the optimal solution in polynomial time. For the general case, we propose a greedy heuristic to obtain the sub-optimal solution efficiently. Simulation results demonstrate that the proposed framework significantly improves task throughput under deadline constraints compared with user scheduling without exploiting parameter sharing.

PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference

TL;DR

This work tackles reducing end-to-end latency in wireless edge inference by exploiting parameter sharing across AI models to avoid redundant loading. It formulates a joint scheduling and bandwidth allocation problem, decouples it into a scheduling problem and a closed-form bandwidth policy, and analyzes a backbone-sharing special case with a dynamic-programming solution. For the general case, it proposes a greedy heuristic, both designed to maximize task throughput under GPU-memory, bandwidth, and latency constraints. Empirical results show substantial throughput gains over traditional, non-sharing strategies, highlighting the practical impact of strategic model loading and batch scheduling in edge AI deployments.

Abstract

By provisioning inference offloading services, edge inference drives the rapid growth of AI applications at network edge. However, how to reduce the inference latency remains a significant challenge. To address this issue, we develop a parameter-sharing AI model loading (PartialLoading) framework for multi-user edge inference, which exploits two key insights: 1) the majority of latency arises from loading AI models into server GPU memory, and 2) different AI models can share a significant number of parameters, for which redundant loading should be avoided. Towards this end, we formulate a joint multi-user scheduling and spectrum bandwidth allocation problem to maximize task throughput by exploiting shared parameter blocks across models. The intuition is to judiciously schedule user requests to reuse the shared parameter blocks between consecutively loaded models, thereby reducing model loading time substantially. To facilitate solution finding, we decouple the problem into two sub-problems, i.e., user scheduling and bandwidth allocation, showing that solving them sequentially leads to the solution to the original problem. Due to the NP-hardness of the problem, we first study an important special case called the "backbone-sharing" case, and design a dynamic programming-based algorithm to obtain the optimal solution in polynomial time. For the general case, we propose a greedy heuristic to obtain the sub-optimal solution efficiently. Simulation results demonstrate that the proposed framework significantly improves task throughput under deadline constraints compared with user scheduling without exploiting parameter sharing.

Paper Structure

This paper contains 34 sections, 7 theorems, 33 equations, 9 figures, 3 algorithms.

Key Result

Proposition 1

Given any user scheduling decision in batch $n$, the corresponding optimal spectrum bandwidth allocation in batch $n$ can be obtained by minimizing $t_{n}^{\text{up}}$. For given ${\bf{X}}$, the minimum value of $t_{n}^{\text{up}}$ is and the corresponding optimal spectrum bandwidth allocation under ${\bf{X}}$ is

Figures (9)

  • Figure 1: Total latency across various model structures. Model loading refers to the process of loading an AI model to GPU memory, while inference computing includes data tensorization and batching, moving data to GPU memory, and forward propagation. The models are from the ResNet family he2016deep, with a batch size of 32, evaluated on the CIFAR10 dataset krizhevsky2009learning. Inference is executed on a Linux server equipped with an Intel Core i9-13900K CPU, a GeForce RTX 4090 GPU with 24 GB GPU memory, a Toshiba 8 TB SATA3 HDD, and two Kingston 32 GB DDR5 RAM modules.
  • Figure 2: Inference accuracy vs. the number of frozen bottom layers in fine-tuned ResNet-50 models he2016deep, where the ResNet-50 is first pre-trained on CIFAR-100 krizhevsky2009learning and then fine-tuned on CIFAR-10 krizhevsky2009learning. Tasks 1 and 2, respectively, correspond to the classification of labels 0-4 and 5-9 in CIFAR10. This example shows that different downstream models can share a significant proportion of layers without performance degradation compared with full-parameter fine-tuning (0 frozen layers).
  • Figure 3: Our considered single-edge multi-user scenario, where parameter blocks can be shared among AI models hosted on the server. In this figure, the first two layers (in orange) are shared between model 1 and model 2.
  • Figure 4: Workflow and timeline of an inference batch. We use object classification/detection applications as an illustrative example. Mobile devices offload the captured images to the edge server and request model $i$ in batch $n$. Since model $i$ shares the first two layers (in orange) with model $i-1$, which has been loaded in the previous batch (batch $n-1$), the edge server only loads the last two layers (in green) of model $i$ into the GPU memory for inference. The main goal of this paper is to schedule users into a sequence of batches to optimally leverage the parameters shared across models, thereby enhancing task throughput under latency constraints.
  • Figure 5: An illustrative example of backbone sharing within two model clusters, $m$ and $m'$, in the special case. Within each model cluster, models share a backbone or a subset of its bottom layers. In cluster $m$, the first layer of model 1 and the first three layers of models 2 and 3 come from the bottom layers of backbone $\mathcal{W}_{m}$. In cluster $m'$, models 4 and 5 share the entire backbone $\mathcal{W}_{m'}$.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Definition 1
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark 1
  • Definition 2
  • Theorem 3
  • ...and 9 more