Table of Contents
Fetching ...

Concurrent Training and Layer Pruning of Deep Neural Networks

Valentin Frank Ingmar Guenter, Athanasios Sideris

TL;DR

The paper tackles the high computational cost of training and deploying very deep neural networks by introducing concurrent training and layer pruning. It inserts layerwise Bernoulli multipliers into residual structures, learns their activation probabilities via variational inference with a flattening hyper-prior, and proves that optimal solutions are deterministic ($\theta^l \in \{0,1\}$). A projected SGD algorithm with efficient gradient estimators converges to a minimum, enabling early pruning and substantial FLOPs/parameter reductions while maintaining accuracy. Experiments on MNIST, CIFAR-10/100, and ImageNet with LeNet, VGG16, and ResNet demonstrate competitive pruning performance with significant training-time savings. The approach provides practical pruning criteria, rigorous convergence guarantees, and scalable applicability across architectures.

Abstract

We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights and allows for substantial cost savings during both training and inference. More specifically, the variational posterior distribution of scalar Bernoulli random variables multiplying a layer weight matrix of its nonlinear sections is learned, similarly to adaptive layer-wise dropout. To overcome challenges of concurrent learning and pruning such as premature pruning and lack of robustness with respect to weight initialization or the size of the starting network, we adopt the "flattening" hyper-prior on the prior parameters. We prove that, as a result of its usage, the solutions of the resulting optimization problem describe deterministic networks with parameters of the posterior distribution at either 0 or 1. We formulate a projected SGD algorithm and prove its convergence to such a solution using stochastic approximation results. In particular, we prove conditions that lead to a layer's weights converging to zero and derive practical pruning conditions from the theoretical results. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our method achieves state-of the-art performance for layer pruning at reduced computational cost in distinction to competing methods due to the concurrent training and pruning.

Concurrent Training and Layer Pruning of Deep Neural Networks

TL;DR

The paper tackles the high computational cost of training and deploying very deep neural networks by introducing concurrent training and layer pruning. It inserts layerwise Bernoulli multipliers into residual structures, learns their activation probabilities via variational inference with a flattening hyper-prior, and proves that optimal solutions are deterministic (). A projected SGD algorithm with efficient gradient estimators converges to a minimum, enabling early pruning and substantial FLOPs/parameter reductions while maintaining accuracy. Experiments on MNIST, CIFAR-10/100, and ImageNet with LeNet, VGG16, and ResNet demonstrate competitive pruning performance with significant training-time savings. The approach provides practical pruning criteria, rigorous convergence guarantees, and scalable applicability across architectures.

Abstract

We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights and allows for substantial cost savings during both training and inference. More specifically, the variational posterior distribution of scalar Bernoulli random variables multiplying a layer weight matrix of its nonlinear sections is learned, similarly to adaptive layer-wise dropout. To overcome challenges of concurrent learning and pruning such as premature pruning and lack of robustness with respect to weight initialization or the size of the starting network, we adopt the "flattening" hyper-prior on the prior parameters. We prove that, as a result of its usage, the solutions of the resulting optimization problem describe deterministic networks with parameters of the posterior distribution at either 0 or 1. We formulate a projected SGD algorithm and prove its convergence to such a solution using stochastic approximation results. In particular, we prove conditions that lead to a layer's weights converging to zero and derive practical pruning conditions from the theoretical results. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our method achieves state-of the-art performance for layer pruning at reduced computational cost in distinction to competing methods due to the concurrent training and pruning.
Paper Structure (29 sections, 4 theorems, 81 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 4 theorems, 81 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider the Neural Network eq:struct_res and the optimization problem eq:L_train_obj. Then, the minimal values of eq:L_train_obj with respect to $\Theta$ are achieved at the extreme values $\theta^l=0,\ 1$ and the resulting optimal network is deterministic.

Figures (5)

  • Figure 1: A section of the general NN structure considered in this work. $z^l$ and $z^{l+1}$ are feature vectors in the NN. $Block^l(W,a)$ represents a single or multiple consecutive layers of weights $W$ and activation functions "$a$". $\xi^l$ are Bernoulli Random Variables with parameter $\pi^l$ and $h^l$ is an activation function.
  • Figure 2: The specific network structure considered in this work (Ours res. in Table \ref{['tab:architectures']}). $z^l$ are input features, $a^l$ the activation function and $W_1^l, W_2^l$ and $W_3^l$ network weights. The Bernoulli RV $\xi^l$ with parameter $\pi^l$ multiplies the output of the block $W_1^l-a^l-W_2^l$ and yields $\bar{z}^l$; $\bar{\delta}^l$ and $\delta^{l+1}$ are back-propagated gradients of the loss function with respect to $\bar{z}^l$ and $z^l$, respectively.
  • Figure 3: Histograms of the layer indices that survived the pruning/training process when using our algorithm with the convolutional NN architectures on the MNIST dataset.
  • Figure 4: Results for our version of VGG16 on CIFAR-10. Left: The total number of parameters in the network during training/pruning. Right: Cumulative training load of the network during training/pruning.
  • Figure 5: Results for our $W$-$B$-$a$-$W$ version of ResNet110 on CIFAR-10. Left: The total number of parameters in the network during training/pruning. Right: Cumulative training load of the network during training/pruning.

Theorems & Definitions (8)

  • Theorem 3.1
  • proof
  • Theorem 5.1
  • proof
  • Theorem 5.2
  • proof
  • Theorem 5.3
  • proof