Table of Contents
Fetching ...

Adaptive Depth Networks with Skippable Sub-Paths

Woochul Kang, Hyungseop Lee

TL;DR

This paper provides a formal rationale for why the proposed training method can reduce overall prediction errors while minimizing the impact of skipping sub-paths and demonstrates the generality and effectiveness of the approach with convolutional neural networks and transformers.

Abstract

Predictable adaptation of network depths can be an effective way to control inference latency and meet the resource condition of various devices. However, previous adaptive depth networks do not provide general principles and a formal explanation on why and which layers can be skipped, and, hence, their approaches are hard to be generalized and require long and complex training steps. In this paper, we present a practical approach to adaptive depth networks that is applicable to various networks with minimal training effort. In our approach, every hierarchical residual stage is divided into two sub-paths, and they are trained to acquire different properties through a simple self-distillation strategy. While the first sub-path is essential for hierarchical feature learning, the second one is trained to refine the learned features and minimize performance degradation if it is skipped. Unlike prior adaptive networks, our approach does not train every target sub-network in an iterative manner. At test time, however, we can connect these sub-paths in a combinatorial manner to select sub-networks of various accuracy-efficiency trade-offs from a single network. We provide a formal rationale for why the proposed training method can reduce overall prediction errors while minimizing the impact of skipping sub-paths. We demonstrate the generality and effectiveness of our approach with convolutional neural networks and transformers.

Adaptive Depth Networks with Skippable Sub-Paths

TL;DR

This paper provides a formal rationale for why the proposed training method can reduce overall prediction errors while minimizing the impact of skipping sub-paths and demonstrates the generality and effectiveness of the approach with convolutional neural networks and transformers.

Abstract

Predictable adaptation of network depths can be an effective way to control inference latency and meet the resource condition of various devices. However, previous adaptive depth networks do not provide general principles and a formal explanation on why and which layers can be skipped, and, hence, their approaches are hard to be generalized and require long and complex training steps. In this paper, we present a practical approach to adaptive depth networks that is applicable to various networks with minimal training effort. In our approach, every hierarchical residual stage is divided into two sub-paths, and they are trained to acquire different properties through a simple self-distillation strategy. While the first sub-path is essential for hierarchical feature learning, the second one is trained to refine the learned features and minimize performance degradation if it is skipped. Unlike prior adaptive networks, our approach does not train every target sub-network in an iterative manner. At test time, however, we can connect these sub-paths in a combinatorial manner to select sub-networks of various accuracy-efficiency trade-offs from a single network. We provide a formal rationale for why the proposed training method can reduce overall prediction errors while minimizing the impact of skipping sub-paths. We demonstrate the generality and effectiveness of our approach with convolutional neural networks and transformers.
Paper Structure (21 sections, 4 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) During training, every residual stage of a network is divided into two sub-paths. The layers in every second (orange) sub-path are optimized to minimize performance degradation even if they are skipped. (b) At test time, these second sub-paths can be skipped in a combinatorial manner, allowing instant selection of various parameter sharing sub-networks. (c) The sub-networks selected from a single network form a better Pareto frontier than counterpart individual networks.
  • Figure 2: Illustration of a residual stage with two sub-paths. While the first (blue) sub-path is mandatory for hierarchical feature learning, the second (orange) sub-path can be skipped for efficiency. The layers in the skippable sub-path are trained to preserve the feature distribution from $\mathbf{h}^s_{base}$ to $\mathbf{h}^s_{super}$ using the proposed self-distillation strategy. Having similar distributions, either $\mathbf{h}^s_{base}$ or $\mathbf{h}^s_{super}$ can be provided as input $\mathbf{h}^s$ to the next residual stage. In the mandatory sub-path, another set of batch normalization operators, called skip-aware BNs, is exploited if the second sub-path is skipped. These sub-paths are building blocks to construct sub-networks of varying depths.
  • Figure 3: $||F(\mathbf{h})||_2 / ||\mathbf{h}||_2$ at residual blocks. In ours, skippable sub-paths (orange areas) minimally change the distribution of input $\textbf{h}$.
  • Figure 4: (a) Results on ImageNet validation dataset. Networks with the suffix '-Base' have the same depths as the base-nets of corresponding adaptive depth networks. (b) Pareto frontiers formed by the sub-networks of our adaptive depth networks. ResNet50 (individual) and ResNet50 (KD individual) are non-adaptive networks having same depths as the sub-networks of ResNet50-ADN.
  • Figure 5: (a) Training time (1 epoch), measured on Nvidia RTX 4090 (batch size: 128). AlphaNet$^*$ is configured to have similar FLOPs to MbV2 and makes sub-networks by only adjusting the network depth. (b) Inference latency and energy consumption of adaptive networks, measured on Nvidia Jetson Orin Nano (batch size: 1)
  • ...and 3 more figures