Multi-DNN Inference of Sparse Models on Edge SoCs

Jiawei Luo; Di Wu; Simon Dobson; Blesson Varghese

Multi-DNN Inference of Sparse Models on Edge SoCs

Jiawei Luo, Di Wu, Simon Dobson, Blesson Varghese

TL;DR

This work introduces model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training, and presents a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs.

Abstract

Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.

Multi-DNN Inference of Sparse Models on Edge SoCs

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 16 figures, 7 tables, 2 algorithms)

This paper contains 21 sections, 7 equations, 16 figures, 7 tables, 2 algorithms.

Introduction
Motivation and Challenges
Definition, Scope and Benefit of Model Stitching
Challenges of Model Stitching
Design of SparseLoom
Stitching subgraphs from sparse variants
Estimating Accuracy and Latency of Stitched Variants
Optimizing Processor Placement Order and Selecting Stitched Variants
Preloading Subgraphs of Variants
Implementation
Evaluation
Experimental Setup
End-to-end Performance
Evaluation of Individual Modules
Discussion
...and 6 more sections

Figures (16)

Figure 1: An AR use case of multi-DNN inference on a heterogeneous edge SoC. The application executes four tasks in parallel on the CPU, GPU, and NPU. Each task is provisioned with a sparse model zoo to meet varying SLO requirements.
Figure 2: Model stitching: given three sparse models (dense, pruned, quantized) each split into subgraphs S1–S3, Stitched Variant 1 combines S1 from the Dense Model (blue), S2 from the Pruned Variant (purple), and S3 from the Quantized Variant (orange).
Figure 3: SLO violations with vs. without stitching. Stitching substantially reduces SLO violation rate. The x-axis represents different SLO configurations, where larger index (e.g., C8) corresponds to more challenging SLO with stricter accuracy and latency requirements.
Figure 4: Histogram of ResNet101 variants in the accuracy–latency space; cell counts show density, and red-edged cells indicate the Pareto frontier. Model stitching expand the variant space and achieve a better accuracy–latency frontier than original sparse models.
Figure 5: (a) Latency breakdown, including compilation, loading, and inference. (b) Memory breakdown, including active variants, preloaded variants, others.
...and 11 more figures

Multi-DNN Inference of Sparse Models on Edge SoCs

TL;DR

Abstract

Multi-DNN Inference of Sparse Models on Edge SoCs

Authors

TL;DR

Abstract

Table of Contents

Figures (16)