Table of Contents
Fetching ...

Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks

Cheng Gong, Yao Chen, Qiuyang Luo, Ye Lu, Tao Li, Yuzhi Zhang, Yufei Sun, Le Zhang

TL;DR

Multi-exit networks enable early predictions but suffer gradient conflict among shared backbone weights, limiting accuracy. The authors propose Deep Feature Surgery (DFS), combining feature partitioning and feature referencing to decouple shared parameters and reuse multi-scale features across exits, enabling end-to-end optimization with reduced backprop costs. DFS achieves up to $50.00\\%$ training-time reduction and up to $6.94\\%$ accuracy gains on CIFAR-100 and ImageNet across models, and reduces average FLOPs per image by about $2\\times$ on budgeted tasks. The method is validated with extensive experiments on ResNet18 and MSDNet, and code is publicly available at the project's GitHub repository.

Abstract

Multi-exit network is a promising architecture for efficient model inference by sharing backbone networks and weights among multiple exits. However, the gradient conflict of the shared weights results in sub-optimal accuracy. This paper introduces Deep Feature Surgery (\methodname), which consists of feature partitioning and feature referencing approaches to resolve gradient conflict issues during the training of multi-exit networks. The feature partitioning separates shared features along the depth axis among all exits to alleviate gradient conflict while simultaneously promoting joint optimization for each exit. Subsequently, feature referencing enhances multi-scale features for distinct exits across varying depths to improve the model accuracy. Furthermore, \methodname~reduces the training operations with the reduced complexity of backpropagation. Experimental results on Cifar100 and ImageNet datasets exhibit that \methodname~provides up to a \textbf{50.00\%} reduction in training time and attains up to a \textbf{6.94\%} enhancement in accuracy when contrasted with baseline methods across diverse models and tasks. Budgeted batch classification evaluation on MSDNet demonstrates that DFS uses about $\mathbf{2}\boldsymbol{\times}$ fewer average FLOPs per image to achieve the same classification accuracy as baseline methods on Cifar100. The code is available at https://github.com/GongCheng1919/dfs.

Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks

TL;DR

Multi-exit networks enable early predictions but suffer gradient conflict among shared backbone weights, limiting accuracy. The authors propose Deep Feature Surgery (DFS), combining feature partitioning and feature referencing to decouple shared parameters and reuse multi-scale features across exits, enabling end-to-end optimization with reduced backprop costs. DFS achieves up to training-time reduction and up to accuracy gains on CIFAR-100 and ImageNet across models, and reduces average FLOPs per image by about on budgeted tasks. The method is validated with extensive experiments on ResNet18 and MSDNet, and code is publicly available at the project's GitHub repository.

Abstract

Multi-exit network is a promising architecture for efficient model inference by sharing backbone networks and weights among multiple exits. However, the gradient conflict of the shared weights results in sub-optimal accuracy. This paper introduces Deep Feature Surgery (\methodname), which consists of feature partitioning and feature referencing approaches to resolve gradient conflict issues during the training of multi-exit networks. The feature partitioning separates shared features along the depth axis among all exits to alleviate gradient conflict while simultaneously promoting joint optimization for each exit. Subsequently, feature referencing enhances multi-scale features for distinct exits across varying depths to improve the model accuracy. Furthermore, \methodname~reduces the training operations with the reduced complexity of backpropagation. Experimental results on Cifar100 and ImageNet datasets exhibit that \methodname~provides up to a \textbf{50.00\%} reduction in training time and attains up to a \textbf{6.94\%} enhancement in accuracy when contrasted with baseline methods across diverse models and tasks. Budgeted batch classification evaluation on MSDNet demonstrates that DFS uses about fewer average FLOPs per image to achieve the same classification accuracy as baseline methods on Cifar100. The code is available at https://github.com/GongCheng1919/dfs.
Paper Structure (20 sections, 14 equations, 5 figures, 4 tables)

This paper contains 20 sections, 14 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Solutions to improve the accuracy of the exits for multi-exit networks. (a) Feature enhancement introduces additional features to improve the accuracy of exits. (b) Gradient selection adjusts gradients from conflicting sources to attain consistent update directions. (c) Our method alleviates gradient conflict and ensures end-to-end joint optimization of exits.
  • Figure 2: Deep feature surgery for multi-exit networks. The black arrow represents forward and backward propagation. The blue arrow indicates feature reference. The red-doted line represents the partitioning position. DFS splits the features of each layer into two distinct parts $f_i^+$ and $f_i^-$ with coefficient $\beta$, which reduces the number of shared weights $w_i^+$ among different exits thus mitigating gradient conflict and reducing backward computation operations. DFS cross-references the shared features and exit-specific features among exits with varying depths in the forward phase while ignoring this in the backward phase, thus using more features for predicting tasks while not introducing more inconsistent gradients.
  • Figure 3: Accuracy of budgeted batch classification. The X-axis is the average computational budget per image for MSDNet model on the Cifar100, and Y-axis top-1 accuracy.
  • Figure 4: Training efficiency comparison. Y1-axis is the average latency (divided into forward, backward, and other parts) for one training step, Y2-axis is the average FPS, and X-axis is the results of DSN, BYOT, and DFS under different batch sizes.
  • Figure 5: The impact of the coefficient on task accuracy and training efficiency. The left side shows the normalized accuracy of DFS with different $\beta$. It focuses on the sort instead of the absolute value of accuracy results on each task. The right side is the training efficiency of VGG7-64 using DFS with different $\beta$. Y1-axis is the average latency and y2-axis is the average FPS metric.