Table of Contents
Fetching ...

Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment

Jiaqi Wu, Simin Chen, Zehua Wang, Wei Chen, Zijian Tian, F. Richard Yu, Victor C. M. Leung

TL;DR

A novel co-design framework to optimize neural network architecture and deployment strategies during inference for high-throughput using a dynamic model structure based on re-parameterization coupled with a Roofline-based model partitioning strategy to enhance the computational performance of edge devices.

Abstract

As the volume of image data grows, data-oriented cloud computing in Internet of Video Things (IoVT) systems encounters latency issues. Task-oriented edge computing addresses this by shifting data analysis to the edge. However, limited computational power of edge devices poses challenges for executing visual tasks. Existing methods struggle to balance high model performance with low resource consumption; lightweight neural networks often underperform, while device-specific models designed by Neural Architecture Search (NAS) fail to adapt to heterogeneous devices. For these issues, we propose a novel co-design framework to optimize neural network architecture and deployment strategies during inference for high-throughput. Specifically, it implements a dynamic model structure based on re-parameterization, coupled with a Roofline-based model partitioning strategy to enhance the computational performance of edge devices. We also employ a multi-objective co-optimization approach to balance throughput and accuracy. Additionally, we derive mathematical consistency and convergence of partitioned models. Experimental results demonstrate significant improvements in throughput (12.05\% on MNIST, 18.83\% on ImageNet) and superior classification accuracy compared to baseline algorithms. Our method consistently achieves stable performance across different devices, underscoring its adaptability. Simulated experiments further confirm its efficacy in high-accuracy, real-time detection for small objects in IoVT systems.

Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment

TL;DR

A novel co-design framework to optimize neural network architecture and deployment strategies during inference for high-throughput using a dynamic model structure based on re-parameterization coupled with a Roofline-based model partitioning strategy to enhance the computational performance of edge devices.

Abstract

As the volume of image data grows, data-oriented cloud computing in Internet of Video Things (IoVT) systems encounters latency issues. Task-oriented edge computing addresses this by shifting data analysis to the edge. However, limited computational power of edge devices poses challenges for executing visual tasks. Existing methods struggle to balance high model performance with low resource consumption; lightweight neural networks often underperform, while device-specific models designed by Neural Architecture Search (NAS) fail to adapt to heterogeneous devices. For these issues, we propose a novel co-design framework to optimize neural network architecture and deployment strategies during inference for high-throughput. Specifically, it implements a dynamic model structure based on re-parameterization, coupled with a Roofline-based model partitioning strategy to enhance the computational performance of edge devices. We also employ a multi-objective co-optimization approach to balance throughput and accuracy. Additionally, we derive mathematical consistency and convergence of partitioned models. Experimental results demonstrate significant improvements in throughput (12.05\% on MNIST, 18.83\% on ImageNet) and superior classification accuracy compared to baseline algorithms. Our method consistently achieves stable performance across different devices, underscoring its adaptability. Simulated experiments further confirm its efficacy in high-accuracy, real-time detection for small objects in IoVT systems.

Paper Structure

This paper contains 20 sections, 27 equations, 11 figures, 7 tables, 3 algorithms.

Figures (11)

  • Figure 1: The comparison between centralized cloud computing and edge computing approach in the IoVT system. Centralized cloud computing is a typical data-driven design, requiring all data to be uploaded to a central server for inference. In contrast, edge computing is a task-oriented paradigm, where data analysis tasks are offloaded from the cloud to the edge, closer to the data collection points, in order to reduce system latency.
  • Figure 2: Dynamic model architecture of RepVGG. GoogLeNet uses convolutional layers with different kernel sizes to extract multi-scale features, but this also significantly increases the model's computational complexity. RepVGG, on the other hand, utilizes multi-branch structures during the training phase to extract rich features and merges these branches during the inference phase. This approach reduces the model size while effectively maintaining performance.
  • Figure 3: The overview of our method. At the system-level, we employ a model partitioning method based on Roofline analysis, where sub-model 1 and sub-model 2 are deployed on the computing terminal and the edge server, respectively, according to the selected partition point. At the model-level, we utilize a dynamic model structure that adapts the network architecture during the inference phase to better match the computational resources. Multi-objective optimization, as shown in Eq. 11, adjusts both the partition point and model structure to fully leverage the computational performance of the devices, thereby improving throughput while maintaining inference accuracy.
  • Figure 4: The Roofline results of the partitioned model. It shows that $\textit{I}_{\text{mn}}$ represents the computational intensity of different devices. The computational intensity of the model, $\textit{I}_0$, is smaller than $\textit{I}_{m2}$, which indicates that it is constrained by the memory bandwidth (MB constraints) of Device 2, and larger than $\textit{I}_{m1}$, indicating it is constrained by the computational performance (CB constraints) of Device 1. The goal of segmentation deployment is that, after partitioning $\textit{I}_0$, the computational intensity of the segmentation modules, $\textit{I}_1$ and $\textit{I}_2$, should optimally leverage the maximum computational performance of the corresponding devices.
  • Figure 5: The Comparison of real-time performance. "Our" denotes our method. In the boxplots presented in this experiment, black dots indicate the mean values, and hollow circles denote outliers. The response time per image for each method is expressed using the notation “$*10^2$”. Compared to the baseline algorithms, our approach demonstrates the highest throughput and faster response times, consistently achieving real-time inference performance, even on large-scale ImageNet images.
  • ...and 6 more figures