Table of Contents
Fetching ...

Priority-Aware Model-Distributed Inference at Edge Networks

Teng Li, Hulya Seferoglu

TL;DR

PA-MDI tackles the challenge of coordinating partitioned DNN inference across multiple heterogeneous edge devices with different data-source priorities. By formulating a convex, decomposable optimization that maximizes aggregated accuracy while minimizing latency, and by implementing a practical offloading policy with a RTC/CTC contention mechanism, PA-MDI achieves substantial reductions in inference time for time-sensitive sources. The approach is validated on diverse platforms, including NVIDIA Jetson devices and the Colosseum testbed with ResNet-50, ResNet-56, and GPT-2, demonstrating strong improvements over AR-MDI, MS-MDI, and Local baselines, especially in multi-source and multi-hop settings. These results indicate that priority-aware model distribution is a viable path to scalable, low-latency edge inference in real-world, multimodal environments.

Abstract

Distributed inference techniques can be broadly classified into data-distributed and model-distributed schemes. In data-distributed inference (DDI), each worker carries the entire Machine Learning (ML) model but processes only a subset of the data. However, feeding the data to workers results in high communication costs, especially when the data is large. An emerging paradigm is model-distributed inference (MDI), where each worker carries only a subset of ML layers. In MDI, a source device that has data processes a few layers of ML model and sends the output to a neighboring device, i.e., offloads the rest of the layers. This process ends when all layers are processed in a distributed manner. In this paper, we investigate the design and development of MDI when multiple data sources co-exist. We consider that each data source has a different importance and, hence, a priority. We formulate and solve a priority-aware model allocation optimization problem. Based on the structure of the optimal solution, we design a practical Priority-Aware Model- Distributed Inference (PA-MDI) algorithm that determines model allocation and distribution over devices by taking into account the priorities of different sources. Experiments were conducted on a real-life testbed of NVIDIA Jetson Xavier and Nano edge devices as well as in the Colosseum testbed with ResNet-50, ResNet- 56, and GPT-2 models. The experimental results show that PA-MDI performs priority-aware model allocation successfully while reducing the inference time as compared to baselines.

Priority-Aware Model-Distributed Inference at Edge Networks

TL;DR

PA-MDI tackles the challenge of coordinating partitioned DNN inference across multiple heterogeneous edge devices with different data-source priorities. By formulating a convex, decomposable optimization that maximizes aggregated accuracy while minimizing latency, and by implementing a practical offloading policy with a RTC/CTC contention mechanism, PA-MDI achieves substantial reductions in inference time for time-sensitive sources. The approach is validated on diverse platforms, including NVIDIA Jetson devices and the Colosseum testbed with ResNet-50, ResNet-56, and GPT-2, demonstrating strong improvements over AR-MDI, MS-MDI, and Local baselines, especially in multi-source and multi-hop settings. These results indicate that priority-aware model distribution is a viable path to scalable, low-latency edge inference in real-world, multimodal environments.

Abstract

Distributed inference techniques can be broadly classified into data-distributed and model-distributed schemes. In data-distributed inference (DDI), each worker carries the entire Machine Learning (ML) model but processes only a subset of the data. However, feeding the data to workers results in high communication costs, especially when the data is large. An emerging paradigm is model-distributed inference (MDI), where each worker carries only a subset of ML layers. In MDI, a source device that has data processes a few layers of ML model and sends the output to a neighboring device, i.e., offloads the rest of the layers. This process ends when all layers are processed in a distributed manner. In this paper, we investigate the design and development of MDI when multiple data sources co-exist. We consider that each data source has a different importance and, hence, a priority. We formulate and solve a priority-aware model allocation optimization problem. Based on the structure of the optimal solution, we design a practical Priority-Aware Model- Distributed Inference (PA-MDI) algorithm that determines model allocation and distribution over devices by taking into account the priorities of different sources. Experiments were conducted on a real-life testbed of NVIDIA Jetson Xavier and Nano edge devices as well as in the Colosseum testbed with ResNet-50, ResNet- 56, and GPT-2 models. The experimental results show that PA-MDI performs priority-aware model allocation successfully while reducing the inference time as compared to baselines.

Paper Structure

This paper contains 18 sections, 8 equations, 10 figures, 2 algorithms.

Figures (10)

  • Figure 1: Model-distributed inference and model parallelism.
  • Figure 2: Model-distributed inference for multiple data sources. One of the data sources, either image classification or audio analytics, may have a higher priority in terms of processing their data.
  • Figure 3: Worker $A$, who hosts "Non-Time-Sensitive" data, has dataset CIFAR-10 (224x224) which is processed by ResNet-50, while Worker $D$, who hosts "Time-Sensitive" data, has dataset CIFAR-10 (32x32) which is processed by ResNet-56.
  • Figure 4: Worker $A$ has a small data set (CIFAR-10 (32x32), which is processed by distributed ResNet-56) while Worker $D$ has a larger dataset (CIFAR-10 (224x224), which is processed by distributed ResNet-50). Worker $A$ hosts "Non-Time-Sensitive" data, while Worker $D$ hosts "Time-Sensitive" data.
  • Figure 5: Both Workers $A$ and $D$ have larger dataset (CIFAR-10 (224x224), which is processed by ResNet-50). Similar to the previous scenarios, Worker $A$ hosts "Non-Time-Sensitive" data, while Worker $D$ hosts "Time-Sensitive" data.
  • ...and 5 more figures