Table of Contents
Fetching ...

Latency-aware Unified Dynamic Networks for Efficient Image Recognition

Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

TL;DR

LAUDNet tackles the practical inefficiency of dynamic networks by unifying spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping under a latency-guided co-design. A latency predictor models block-level latency incorporating hardware properties, dynamic granularity, and activation rates to steer both algorithm design and scheduling on GPUs. The approach yields substantial real-world speedups (e.g., >50% latency reduction for ResNet-101) across server GPUs and edge devices while preserving accuracy, and demonstrates applicability to CNNs and vision transformers with robust empirical results on ImageNet and COCO. These findings highlight the value of latency-aware design for deploying adaptive inference in real-world vision systems and suggest promising extensions to broader architectures and tasks.

Abstract

Dynamic computation has emerged as a promising avenue to enhance the inference efficiency of deep networks. It allows selective activation of computational units, leading to a reduction in unnecessary computations for each input sample. However, the actual efficiency of these dynamic models can deviate from theoretical predictions. This mismatch arises from: 1) the lack of a unified approach due to fragmented research; 2) the focus on algorithm design over critical scheduling strategies, especially in CUDA-enabled GPU contexts; and 3) challenges in measuring practical latency, given that most libraries cater to static operations. Addressing these issues, we unveil the Latency-Aware Unified Dynamic Networks (LAUDNet), a framework that integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. To bridge the theoretical and practical efficiency gap, LAUDNet merges algorithmic design with scheduling optimization, guided by a latency predictor that accurately gauges dynamic operator latency. We've tested LAUDNet across multiple vision tasks, demonstrating its capacity to notably reduce the latency of models like ResNet-101 by over 50% on platforms such as V100, RTX3090, and TX2 GPUs. Notably, LAUDNet stands out in balancing accuracy and efficiency. Code is available at: https://www.github.com/LeapLabTHU/LAUDNet.

Latency-aware Unified Dynamic Networks for Efficient Image Recognition

TL;DR

LAUDNet tackles the practical inefficiency of dynamic networks by unifying spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping under a latency-guided co-design. A latency predictor models block-level latency incorporating hardware properties, dynamic granularity, and activation rates to steer both algorithm design and scheduling on GPUs. The approach yields substantial real-world speedups (e.g., >50% latency reduction for ResNet-101) across server GPUs and edge devices while preserving accuracy, and demonstrates applicability to CNNs and vision transformers with robust empirical results on ImageNet and COCO. These findings highlight the value of latency-aware design for deploying adaptive inference in real-world vision systems and suggest promising extensions to broader architectures and tasks.

Abstract

Dynamic computation has emerged as a promising avenue to enhance the inference efficiency of deep networks. It allows selective activation of computational units, leading to a reduction in unnecessary computations for each input sample. However, the actual efficiency of these dynamic models can deviate from theoretical predictions. This mismatch arises from: 1) the lack of a unified approach due to fragmented research; 2) the focus on algorithm design over critical scheduling strategies, especially in CUDA-enabled GPU contexts; and 3) challenges in measuring practical latency, given that most libraries cater to static operations. Addressing these issues, we unveil the Latency-Aware Unified Dynamic Networks (LAUDNet), a framework that integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. To bridge the theoretical and practical efficiency gap, LAUDNet merges algorithmic design with scheduling optimization, guided by a latency predictor that accurately gauges dynamic operator latency. We've tested LAUDNet across multiple vision tasks, demonstrating its capacity to notably reduce the latency of models like ResNet-101 by over 50% on platforms such as V100, RTX3090, and TX2 GPUs. Notably, LAUDNet stands out in balancing accuracy and efficiency. Code is available at: https://www.github.com/LeapLabTHU/LAUDNet.
Paper Structure (26 sections, 6 equations, 14 figures, 5 tables)

This paper contains 26 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: An overview of our method. (a) illustrates three representative adaptive inference algorithms (i.e. spatial-wise dynamic convolution, channel skipping, and layer skipping); (b) is an example of the scheduling strategy for spatial-wise dynamic convolution; and (c) presents our key idea of using the latency to guide both algorithm design and scheduling optimization.
  • Figure 2: Our proposed LAUDNet block. (a) we first use a lightweight module to generate the channel mask $\mathbf{M}^\mathrm{c}$ or the spatial/layer mask $\mathbf{M}^\mathrm{s}$/$\mathbf{M}^\mathrm{l}$. The granularity of dynamic inference is controlled by $G$ (for channel skipping) and $S$ (for spatially adaptive computation). During training, the channel mask is multiplied with the input and output of the $3\times 3$ convolution, and the spatial mask is applied on the final output of the block. Layer skipping could be easily implemented by setting $S$ equal to the feature resolution. The scheduling strategies in inference ((b) for spatial-wise dynamic convolution and (c) for channel skipping) is performed to decrease memory access and facilitate parallel computation (Sec. \ref{['sec_schedule_optim']}). Note that we omit layer skipping here due to its simplicity: the whole block will be executed if the layer masker produces a value of 1.
  • Figure 3: The architecture design of two types of maskers. The spatial/layer masker (a) is composed of a an adaptive pooling layer and a $1\times 1$ convolution. The channel makser (b) consists of a global average pooling and a 2-layer MLP. The argmax operation is directly applied to obtain the discrete decisions during inference, while Gumbel Softmax jang2016categoricalmaddison2016concrete is utilized for end-to-end training (Sec. \ref{['sec:train']}).
  • Figure 4: Our hardware model, which allows us to model the latency of both data moving and computation.
  • Figure 5: Comparison between the real and predicted latency of a dynamic block in LAUD-ResNet-101.
  • ...and 9 more figures