Table of Contents
Fetching ...

Multi-DNN Inference of Sparse Models on Edge SoCs

Jiawei Luo, Di Wu, Simon Dobson, Blesson Varghese

TL;DR

This work introduces model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training, and presents a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs.

Abstract

Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.

Multi-DNN Inference of Sparse Models on Edge SoCs

TL;DR

This work introduces model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training, and presents a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs.

Abstract

Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.
Paper Structure (21 sections, 7 equations, 16 figures, 7 tables, 2 algorithms)

This paper contains 21 sections, 7 equations, 16 figures, 7 tables, 2 algorithms.

Figures (16)

  • Figure 1: An AR use case of multi-DNN inference on a heterogeneous edge SoC. The application executes four tasks in parallel on the CPU, GPU, and NPU. Each task is provisioned with a sparse model zoo to meet varying SLO requirements.
  • Figure 2: Model stitching: given three sparse models (dense, pruned, quantized) each split into subgraphs S1–S3, Stitched Variant 1 combines S1 from the Dense Model (blue), S2 from the Pruned Variant (purple), and S3 from the Quantized Variant (orange).
  • Figure 3: SLO violations with vs. without stitching. Stitching substantially reduces SLO violation rate. The x-axis represents different SLO configurations, where larger index (e.g., C8) corresponds to more challenging SLO with stricter accuracy and latency requirements.
  • Figure 4: Histogram of ResNet101 variants in the accuracy–latency space; cell counts show density, and red-edged cells indicate the Pareto frontier. Model stitching expand the variant space and achieve a better accuracy–latency frontier than original sparse models.
  • Figure 5: (a) Latency breakdown, including compilation, loading, and inference. (b) Memory breakdown, including active variants, preloaded variants, others.
  • ...and 11 more figures