Table of Contents
Fetching ...

Elimination-compensation pruning for fully-connected neural networks

Enrico Ballini, Luca Muscarnera, Alessio Fumagalli, Anna Scotti, Francesco Regazzoni

TL;DR

This work introduces a novel pruning method in which the importance measure of each weight is computed considering the output behavior after an optimal perturbation of its adjacent bias, efficiently computable by automatic differentiation.

Abstract

The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of parameters that characterize model architectures. Pruning techniques affirmed themselves as valid tools to extract sparse representations of neural networks parameters, carefully balancing between compression and preservation of information. However, a fundamental assumption behind pruning is that expendable weights should have small impact on the error of the network, while highly important weights should tend to have a larger influence on the inference. We argue that this idea could be generalized; what if a weight is not simply removed but also compensated with a perturbation of the adjacent bias, which does not contribute to the network sparsity? Our work introduces a novel pruning method in which the importance measure of each weight is computed considering the output behavior after an optimal perturbation of its adjacent bias, efficiently computable by automatic differentiation. These perturbations can be then applied directly after the removal of each weight, independently of each other. After deriving analytical expressions for the aforementioned quantities, numerical experiments are conducted to benchmark this technique against some of the most popular pruning strategies, demonstrating an intrinsic efficiency of the proposed approach in very diverse machine learning scenarios. Finally, our findings are discussed and the theoretical implications of our results are presented.

Elimination-compensation pruning for fully-connected neural networks

TL;DR

This work introduces a novel pruning method in which the importance measure of each weight is computed considering the output behavior after an optimal perturbation of its adjacent bias, efficiently computable by automatic differentiation.

Abstract

The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of parameters that characterize model architectures. Pruning techniques affirmed themselves as valid tools to extract sparse representations of neural networks parameters, carefully balancing between compression and preservation of information. However, a fundamental assumption behind pruning is that expendable weights should have small impact on the error of the network, while highly important weights should tend to have a larger influence on the inference. We argue that this idea could be generalized; what if a weight is not simply removed but also compensated with a perturbation of the adjacent bias, which does not contribute to the network sparsity? Our work introduces a novel pruning method in which the importance measure of each weight is computed considering the output behavior after an optimal perturbation of its adjacent bias, efficiently computable by automatic differentiation. These perturbations can be then applied directly after the removal of each weight, independently of each other. After deriving analytical expressions for the aforementioned quantities, numerical experiments are conducted to benchmark this technique against some of the most popular pruning strategies, demonstrating an intrinsic efficiency of the proposed approach in very diverse machine learning scenarios. Finally, our findings are discussed and the theoretical implications of our results are presented.
Paper Structure (12 sections, 18 equations, 4 figures, 2 tables)

This paper contains 12 sections, 18 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Test losses on MNIST. In the left column, the losses are computed immediately after applying the pruning methods. In the right column, the losses are computed after fine-tuning, so that the overall procedure can be summarized as training–pruning–training.
  • Figure 2: Time snapshots of $u(x,t)$. The noisy data with $d \sim \mathcal{U}(-0.005, 0.005)$ are shown in black, while the green dashed line represents $u(x,t)$ computed by the pruned neural network with architecture 2 (see Tab. \ref{['tab:architectures_pde']}). In all panels, the network is pruned using the proposed method with a pruning ratio of $0.7$.
  • Figure 3: Test losses on Diffusion-Sorption PDE. In the left column, the losses are computed immediately after applying the pruning methods. In the right column, the losses are computed after fine-tuning, so that the overall procedure can be summarized as training–pruning–training.
  • Figure 4: Test losses on Diffusion-Sorption PDE. The bottom row corresponds to a high noisy data. In this case, it is practically difficult to reach low loss values during training, so that the first training phase is sufficient to reach the practical minimum, and the second training phase does not significantly improve performance. Therefore, it is possible for the loss of the baseline to be comparable to that of the fully trained networks, as observed in the bottom-right panel.

Theorems & Definitions (1)

  • Remark 2.1: Pruning for inefficient training