Table of Contents
Fetching ...

SparseDVFS: Sparse-Aware DVFS for Energy-Efficient Edge Inference

Ziyang Zhang, Zheshun Wu, Jie Liu, Luca Mottola

Abstract

Deploying deep neural networks (DNNs) on power-sensitive edge devices presents a formidable challenge. While Dynamic Voltage and Frequency Scaling (DVFS) is widely employed for energy optimization, traditional model-level scaling is often too coarse to capture intra-inference variations, whereas fine-grained operator-level scaling suffers from prohibitive performance degradation due to significant hardware switching latency. This paper presents SparseDVFS, a fine-grained, sparse-aware DVFS framework designed for energy-efficient edge inference. Our key insight is that operator sparsity is a primary metric for hardware frequency modulation. By distinguishing between compute-bound dense operators and memory-bound sparse operators, the system can apply specialized frequency triplets to maximize energy efficiency. To overcome switching overheads and component interference, SparseDVFS incorporates three key innovations: (1) an offline modeler that established a deterministic mapping between operator sparsity and optimal frequency triplets (CPU/GPU/EMC) via white-box timeline analysis; (2) a runtime graph partitioner that utilizes a greedy merging heuristic to aggregate operators into super-blocks, balancing scaling granularity and DVFS switching latency through a latency amortization constraint; and (3) a unified co-governor that employs a frequency unified scaling engine (FUSE) and a look-ahead instruction queue to eliminate antagonistic effects between independent controllers and hide hardware transition latencies. Extensive evaluations show that SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.

SparseDVFS: Sparse-Aware DVFS for Energy-Efficient Edge Inference

Abstract

Deploying deep neural networks (DNNs) on power-sensitive edge devices presents a formidable challenge. While Dynamic Voltage and Frequency Scaling (DVFS) is widely employed for energy optimization, traditional model-level scaling is often too coarse to capture intra-inference variations, whereas fine-grained operator-level scaling suffers from prohibitive performance degradation due to significant hardware switching latency. This paper presents SparseDVFS, a fine-grained, sparse-aware DVFS framework designed for energy-efficient edge inference. Our key insight is that operator sparsity is a primary metric for hardware frequency modulation. By distinguishing between compute-bound dense operators and memory-bound sparse operators, the system can apply specialized frequency triplets to maximize energy efficiency. To overcome switching overheads and component interference, SparseDVFS incorporates three key innovations: (1) an offline modeler that established a deterministic mapping between operator sparsity and optimal frequency triplets (CPU/GPU/EMC) via white-box timeline analysis; (2) a runtime graph partitioner that utilizes a greedy merging heuristic to aggregate operators into super-blocks, balancing scaling granularity and DVFS switching latency through a latency amortization constraint; and (3) a unified co-governor that employs a frequency unified scaling engine (FUSE) and a look-ahead instruction queue to eliminate antagonistic effects between independent controllers and hide hardware transition latencies. Extensive evaluations show that SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.
Paper Structure (35 sections, 3 equations, 19 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 19 figures, 3 tables, 1 algorithm.

Figures (19)

  • Figure 1: Comparison of DVFS governor granularities for edge DNN inference. (a) Model-level DVFS governor employs a single V/f setting. (b) Operator-level fine-grained DVFS governor. (c) Our sparsity-aware block-level DVFS governor.
  • Figure 2: Roofline model across diverse DNN architectures. For CNN-based models (i.e., ResNet-18/101), dense operators (e.g., Conv2d, Linear) reside on the horizontal plateau (i.e., Compute-Bound), while activation and normalization layers fall into the sloped region (i.e., Memory-Bound). For Transformer-based models (i.e., ViT-B16/L16), dense operators (e.g., Conv2d) reside on the horizontal plateau (i.e., Compute-Bound), while linear and normalization layers fall into the sloped region (i.e., Memory-Bound).
  • Figure 3: Dynamic sparsity distribution (CDF) for CNN and Transformer models on ImageNet-2012 validation dataset (about 50k images). A substantial portion of operators in ResNet and ViT exhibit high sparsity (>50%).
  • Figure 4: Runtime traces of CPU and GPU frequencies under default DVFS governors for ResNet-18. Independent and often inverse fluctuations between frequencies create an antagonistic effect.
  • Figure 5: Comparison between DVFS switching overhead and end-to-end inference latency for ResNet-18.
  • ...and 14 more figures