Table of Contents
Fetching ...

LayerDropBack: A Universally Applicable Approach for Accelerating Training of Deep Networks

Evgeny Hershkovitch Neiterman, Gil Ben-Artzi

TL;DR

LayerDropBack (LDB) tackles the high cost of training deep networks by introducing randomness only in the backward pass, leaving the forward pass and the deployed model unchanged. It uses an epoch-based alternation between full backpropagation and layer dropping, with a random subset of layers updated per epoch and adjusted learning rate and batch size during dropped epochs, yielding architecture-agnostic speedups. Across ViT, Swin Transformer, EfficientNet, and DLA on CIFAR-100 and ImageNet, LDB achieves mean training-time reductions up to approximately 24% while maintaining or improving accuracy in many cases. The approach is simple to integrate, empirically effective across diverse models and datasets, and holds promise for broader adoption beyond computer vision.

Abstract

Training very deep convolutional networks is challenging, requiring significant computational resources and time. Existing acceleration methods often depend on specific architectures or require network modifications. We introduce LayerDropBack (LDB), a simple yet effective method to accelerate training across a wide range of deep networks. LDB introduces randomness only in the backward pass, maintaining the integrity of the forward pass, guaranteeing that the same network is used during both training and inference. LDB can be seamlessly integrated into the training process of any model without altering its architecture, making it suitable for various network topologies. Our extensive experiments across multiple architectures (ViT, Swin Transformer, EfficientNet, DLA) and datasets (CIFAR-100, ImageNet) show significant training time reductions of 16.93\% to 23.97\%, while preserving or even enhancing model accuracy. Code is available at \url{https://github.com/neiterman21/LDB}.

LayerDropBack: A Universally Applicable Approach for Accelerating Training of Deep Networks

TL;DR

LayerDropBack (LDB) tackles the high cost of training deep networks by introducing randomness only in the backward pass, leaving the forward pass and the deployed model unchanged. It uses an epoch-based alternation between full backpropagation and layer dropping, with a random subset of layers updated per epoch and adjusted learning rate and batch size during dropped epochs, yielding architecture-agnostic speedups. Across ViT, Swin Transformer, EfficientNet, and DLA on CIFAR-100 and ImageNet, LDB achieves mean training-time reductions up to approximately 24% while maintaining or improving accuracy in many cases. The approach is simple to integrate, empirically effective across diverse models and datasets, and holds promise for broader adoption beyond computer vision.

Abstract

Training very deep convolutional networks is challenging, requiring significant computational resources and time. Existing acceleration methods often depend on specific architectures or require network modifications. We introduce LayerDropBack (LDB), a simple yet effective method to accelerate training across a wide range of deep networks. LDB introduces randomness only in the backward pass, maintaining the integrity of the forward pass, guaranteeing that the same network is used during both training and inference. LDB can be seamlessly integrated into the training process of any model without altering its architecture, making it suitable for various network topologies. Our extensive experiments across multiple architectures (ViT, Swin Transformer, EfficientNet, DLA) and datasets (CIFAR-100, ImageNet) show significant training time reductions of 16.93\% to 23.97\%, while preserving or even enhancing model accuracy. Code is available at \url{https://github.com/neiterman21/LDB}.

Paper Structure

This paper contains 27 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The backpropagation stage dominates training time, consuming 87.2% for ResNet50 and 82.3% for DenseNet121.
  • Figure 2: Our approach is semi-stochastic in the parameter space and stochastic in the sample space. $w_1 \ldots w_m$ represent the parameter space and $\mathcal{B}^t$ the mini-batches. A red rectangle that contains only a subset of items represents a stochastic sampling, while containing the whole items represents deterministic sampling. Our approach is alternating between (c) and (d).
  • Figure 3: Impact of drop rate on top-1 accuracy and speedup for DLA on CIFAR-10.
  • Figure 4: Training loss curves for DenseNet121 on the CIFAR-10 dataset.