Table of Contents
Fetching ...

DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling

Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John

TL;DR

DyCE tackles the problem of static deep learning models that cannot adapt to varying real-time demands. It introduces dynamically configurable exiting by attaching lightweight exit networks at intermediate backbone points, enabling runtime trade-offs between accuracy and compute without re-deploying models. The approach includes a decoupled exit design, training procedures with soft distillation, and search algorithms (iterative and single-pass) to generate configuration sets that meet target performance constraints. Empirical results on ImageNet show meaningful MAC reductions (≈23–26%) with minimal accuracy loss (<0.5%), and DyCE demonstrates strong potential for real-time scaling and broad applicability beyond image classification.

Abstract

Conventional deep learning (DL) model compression and scaling methods focus on altering the model's components, impacting the results across all samples uniformly. However, since samples vary in difficulty, a dynamic model that adapts computation based on sample complexity offers a novel perspective for compression and scaling. Despite this potential, existing dynamic models are typically monolithic and model-specific, limiting their generalizability as broad compression and scaling methods. Additionally, most deployed DL systems are fixed, unable to adjust their scale once deployed and, therefore, cannot adapt to the varying real-time demands. This paper introduces DyCE, a dynamically configurable system that can adjust the performance-complexity trade-off of a DL model at runtime without requiring re-initialization or redeployment on inference hardware. DyCE achieves this by adding small exit networks to intermediate layers of the original model, allowing computation to terminate early if acceptable results are obtained. DyCE also decouples the design of an efficient dynamic model, facilitating easy adaptation to new base models and potential general use in compression and scaling. We also propose methods for generating optimized configurations and determining the types and positions of exit networks to achieve desired performance and complexity trade-offs. By enabling simple configuration switching, DyCE provides fine-grained performance tuning in real-time. We demonstrate the effectiveness of DyCE through image classification tasks using deep convolutional neural networks (CNNs). DyCE significantly reduces computational complexity by 23.5% for ResNet152 and 25.9% for ConvNextv2-tiny on ImageNet, with accuracy reductions of less than 0.5%.

DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling

TL;DR

DyCE tackles the problem of static deep learning models that cannot adapt to varying real-time demands. It introduces dynamically configurable exiting by attaching lightweight exit networks at intermediate backbone points, enabling runtime trade-offs between accuracy and compute without re-deploying models. The approach includes a decoupled exit design, training procedures with soft distillation, and search algorithms (iterative and single-pass) to generate configuration sets that meet target performance constraints. Empirical results on ImageNet show meaningful MAC reductions (≈23–26%) with minimal accuracy loss (<0.5%), and DyCE demonstrates strong potential for real-time scaling and broad applicability beyond image classification.

Abstract

Conventional deep learning (DL) model compression and scaling methods focus on altering the model's components, impacting the results across all samples uniformly. However, since samples vary in difficulty, a dynamic model that adapts computation based on sample complexity offers a novel perspective for compression and scaling. Despite this potential, existing dynamic models are typically monolithic and model-specific, limiting their generalizability as broad compression and scaling methods. Additionally, most deployed DL systems are fixed, unable to adjust their scale once deployed and, therefore, cannot adapt to the varying real-time demands. This paper introduces DyCE, a dynamically configurable system that can adjust the performance-complexity trade-off of a DL model at runtime without requiring re-initialization or redeployment on inference hardware. DyCE achieves this by adding small exit networks to intermediate layers of the original model, allowing computation to terminate early if acceptable results are obtained. DyCE also decouples the design of an efficient dynamic model, facilitating easy adaptation to new base models and potential general use in compression and scaling. We also propose methods for generating optimized configurations and determining the types and positions of exit networks to achieve desired performance and complexity trade-offs. By enabling simple configuration switching, DyCE provides fine-grained performance tuning in real-time. We demonstrate the effectiveness of DyCE through image classification tasks using deep convolutional neural networks (CNNs). DyCE significantly reduces computational complexity by 23.5% for ResNet152 and 25.9% for ConvNextv2-tiny on ImageNet, with accuracy reductions of less than 0.5%.
Paper Structure (37 sections, 10 equations, 7 figures, 2 tables, 3 algorithms)

This paper contains 37 sections, 10 equations, 7 figures, 2 tables, 3 algorithms.

Figures (7)

  • Figure 1: An overview of DyCE system. (a) is the original deep learning model. (b) is the model when DyCE is applied. The backbone of the original model (shown in red) is divided into segments, and multiple early exits (shown in blue) are attached to the end of each segment. The logic of exiting controllers is illustrated in (c). If the confidence level of an output from the exit point is greater than its associated threshold, the inference will complete immediately. Otherwise, the output of the previous backbone segment will be passed to the next segment. The exit select ($k_n$) and corresponding thresholds ($t_n$) are from the selected configuration. These configurations are pre-defined to adapt varying performance-complexity targets.
  • Figure 2: An example of DyCE with ResNet$_{50}$. All red dots selectable performance-complexity operating points in real-time.
  • Figure 3: Backbone segments and attached exits. At run-time one of $K_n$ possible exits will be applied to the output of the $n^\text{th}$ segment.
  • Figure 4: Accuracy of predictions at the $1^{st}$ exit (located at $1^{st}$ skip connection's output) of ResNet$_{\mathit{50}}$ for ImageNet. As the threshold increases, fewer samples exit at this position while the accuracy is rising.
  • Figure 5: MACs and Accuracy of the proposed method, base models and related work. Original ResNet, ConvNeXtv2 and DaViT variants are represented as triangles, the compressed version is denoted by the curve attached to each of them. The performance curve of related work is also plotted for reference. These models can not be configured in runtime or adaptive to new base models. However, DyCE can switch to different configurations (denoted by red dots) in runtime and adapt models with different architectures.
  • ...and 2 more figures