Hierarchical Training of Deep Neural Networks Using Early Exiting

Yamin Sepehri; Pedram Pad; Ahmet Caner Yüzügüler; Pascal Frossard; L. Andrea Dunbar

Hierarchical Training of Deep Neural Networks Using Early Exiting

Yamin Sepehri, Pedram Pad, Ahmet Caner Yüzügüler, Pascal Frossard, L. Andrea Dunbar

TL;DR

The paper tackles the challenge of training high-accuracy DNNs on edge-cloud systems where bandwidth, latency, and privacy constraints are critical. It introduces a hierarchical training paradigm that partitions a network between edge and cloud and uses an edge-side early exit to generate an edge loss, enabling parallel backward updates without exchanging gradients. A runtime model $T^{\text{hierarchical}}_{\text{total}}$ and a separation-point algorithm are developed, and extensive experiments with VGG-16 and ResNet-18 on CIFAR-10 and Tiny ImageNet demonstrate substantial training-time reductions (up to $61\%$ on CIFAR-10 and $81\%$ on Tiny ImageNet) with negligible accuracy loss, especially under low-bandwidth conditions. The approach yields lower edge memory and compute, reduced cloud communication, robustness to network failures, and practical benefits for online learning on mobile and robotic devices in edge-cloud ecosystems, with guidance for selecting partition points and future hierarchical-friendly architectures.

Abstract

Deep neural networks provide state-of-the-art accuracy for vision tasks but they require significant resources for training. Thus, they are trained on cloud servers far from the edge devices that acquire the data. This issue increases communication cost, runtime and privacy concerns. In this study, a novel hierarchical training method for deep neural networks is proposed that uses early exits in a divided architecture between edge and cloud workers to reduce the communication cost, training runtime and privacy concerns. The method proposes a brand-new use case for early exits to separate the backward pass of neural networks between the edge and the cloud during the training phase. We address the issues of most available methods that due to the sequential nature of the training phase, cannot train the levels of hierarchy simultaneously or they do it with the cost of compromising privacy. In contrast, our method can use both edge and cloud workers simultaneously, does not share the raw input data with the cloud and does not require communication during the backward pass. Several simulations and on-device experiments for different neural network architectures demonstrate the effectiveness of this method. It is shown that the proposed method reduces the training runtime for VGG-16 and ResNet-18 architectures by 29% and 61% in CIFAR-10 classification and by 25% and 81% in Tiny ImageNet classification when the communication with the cloud is done over a low bit rate channel. This gain in the runtime is achieved whilst the accuracy drop is negligible. This method is advantageous for online learning of high-accuracy deep neural networks on sensor-holding low-resource devices such as mobile phones or robots as a part of an edge-cloud system, making them more flexible in facing new tasks and classes of data.

Hierarchical Training of Deep Neural Networks Using Early Exiting

TL;DR

and a separation-point algorithm are developed, and extensive experiments with VGG-16 and ResNet-18 on CIFAR-10 and Tiny ImageNet demonstrate substantial training-time reductions (up to

on CIFAR-10 and

on Tiny ImageNet) with negligible accuracy loss, especially under low-bandwidth conditions. The approach yields lower edge memory and compute, reduced cloud communication, robustness to network failures, and practical benefits for online learning on mobile and robotic devices in edge-cloud ecosystems, with guidance for selecting partition points and future hierarchical-friendly architectures.

Abstract

Paper Structure (16 sections, 5 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 10 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Hierarchical Training
Early Exiting
Hierarchical Inference
Hierarchical Training with Early Exiting
Proposed Hierarchical Training Framework
Runtime Analysis
Separation Point Selection
Experiments
Hierarchical Models
Performance Analysis
On-Device Experiments
Ablation Study
Summary and Future Works
...and 1 more sections

Figures (10)

Figure 1: a) A schematic view of the different parts in the proposed hierarchical execution framework. b) In the first step of training, the forward pass is done at the edge feature extractor to generate the feature set that is sent to the cloud. A local decision maker also takes the output of this feature extractor enabling an early exit that is later used for the backward pass at the edge. c) The feature set is communicated to the cloud and a more computationally intensive feature extraction together with a final decision-making are done there. Simultaneously, the backward pass of the edge is done to train the edge parameters. d) The backward pass of the cloud is done to update the cloud parameters. The green borders indicate running processes at each step.
Figure 2: The structure of the proposed hierarchical training method applied on VGG-16 architecture simonyan_very_2015. The position of separation can be moved along the different layers of the architecture.
Figure 3: The structure of the proposed hierarchical training method applied on ResNet-18 architecture resnet2015. The position of separation can be moved along the different residual blocks of the architecture.
Figure 4: The test accuracy of experiments on CIFAR-10 and Tiny ImageNet datasets for hierarchical training when separated at different points along the architectures compared to the full-cloud training accuracy. The left figures show the results for VGG-16 and the right figures show the results for ResNet-18. In addition to the main accuracy, the early exit test accuracy is presented as a side benefit of the proposed method. Although this accuracy is lower in comparison to the final exit, it shows how the edge handles the classification problem independently when there is a possible communication failure. The deepest separation points on the right side (the gray areas) are not practically desirable in the proposed hierarchical training method due to the high memory pressure at the edge and are shown for the sake of completeness (see Figure \ref{['fig:memory']}).
Figure 5: The number of parameters of the deep neural network at the edge side and the cloud side in the hierarchical system when it is separated at different points along the architecture in comparison to the number of parameters in the full-cloud system for CIFAR-10 and Tiny ImageNet experiments. The left figures show the results for VGG-16 and the right figures show the results for ResNet-18. Notice the low number of parameters at the edge for most of the separation points in the proposed hierarchical training method that is desirable due to the possible memory constraints. The deepest separation points on the right side (the gray areas) show where the number of parameters at the edge in the proposed hierarchical training method rises above the cloud. These points are not practically desirable due to the high memory pressure at the edge and are shown for the sake of completeness.
...and 5 more figures

Hierarchical Training of Deep Neural Networks Using Early Exiting

TL;DR

Abstract

Hierarchical Training of Deep Neural Networks Using Early Exiting

Authors

TL;DR

Abstract

Table of Contents

Figures (10)