Table of Contents
Fetching ...

The Difficulty of Training Sparse Neural Networks

Utku Evci, Fabian Pedregosa, Aidan Gomez, Erich Elsen

TL;DR

The paper investigates why sparse neural networks are difficult to train and why pruning-based solutions outperform training sparse networks from scratch or with lottery initializations. It employs energy-landscape analysis and interpolation-based path-finding, including linear lines and Bezier curves, to study optimization in both sparse and dense subspaces for ResNet-50 on ImageNet. It finds a monotonically decreasing path from initialization to the pruned solution within the sparse subspace, but a high-energy barrier between scratch and pruned; allowing dense connectivity via Bezier curves enables decreasing paths between solutions, implying extra dimensions are needed to escape sparse stationary points. These results suggest that future sparse-training methods should enable transitions to denser connectivity or incorporate optimizers and initializations that can bridge sparse configurations, potentially improving sparse network performance.

Abstract

We investigate the difficulties of training sparse neural networks and make new observations about optimization dynamics and the energy landscape within the sparse regime. Recent work of \citep{Gale2019, Liu2018} has shown that sparse ResNet-50 architectures trained on ImageNet-2012 dataset converge to solutions that are significantly worse than those found by pruning. We show that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution. Additionally, our attempts to find a decreasing objective path from "bad" solutions to the "good" ones in the sparse subspace fail. However, if we allow the path to traverse the dense subspace, then we consistently find a path between two solutions. These findings suggest traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.

The Difficulty of Training Sparse Neural Networks

TL;DR

The paper investigates why sparse neural networks are difficult to train and why pruning-based solutions outperform training sparse networks from scratch or with lottery initializations. It employs energy-landscape analysis and interpolation-based path-finding, including linear lines and Bezier curves, to study optimization in both sparse and dense subspaces for ResNet-50 on ImageNet. It finds a monotonically decreasing path from initialization to the pruned solution within the sparse subspace, but a high-energy barrier between scratch and pruned; allowing dense connectivity via Bezier curves enables decreasing paths between solutions, implying extra dimensions are needed to escape sparse stationary points. These results suggest that future sparse-training methods should enable transitions to denser connectivity or incorporate optimizers and initializations that can bridge sparse configurations, potentially improving sparse network performance.

Abstract

We investigate the difficulties of training sparse neural networks and make new observations about optimization dynamics and the energy landscape within the sparse regime. Recent work of \citep{Gale2019, Liu2018} has shown that sparse ResNet-50 architectures trained on ImageNet-2012 dataset converge to solutions that are significantly worse than those found by pruning. We show that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution. Additionally, our attempts to find a decreasing objective path from "bad" solutions to the "good" ones in the sparse subspace fail. However, if we allow the path to traverse the dense subspace, then we consistently find a path between two solutions. These findings suggest traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.

Paper Structure

This paper contains 10 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Test accuracy of ResNet-50 networks trained on ImageNet-2012 dataset at different sparsity levels. We observe a large gap in generalization accuracy between approaches based on pruning and other approaches. See text for details.
  • Figure 2: Experimental setup. In this paper we consider three different methods for obtaining sparse solutions: pruned, lottery and scratch. The pruned solution is obtained by starting with a densely connected network and gradually removing connections during training, whereas the other two solutions are obtained by training sparse networks from start.
  • Figure 3: Linear interpolation experiments between various initial and final points. Interpolations are created with 0.02 increments and evaluated on 500k data-augmented images from training set. Initial and final points (corresponding to coefficients 0 and 1 respectively) are labeled with abbreviations as presented in Figure \ref{['fig:diagram']}. From all points considered there exist a monotonically decreasing path to the solution found through pruning.
  • Figure 4: Interpolation experiments between pruned (P-S) and scratch(S-S) sparse solutions: (left) linear interpolation (middle, right) Bézier curves minimized in (sparse, dense) manifolds. Loss values are calculated using 500k images from the training set.
  • Figure 5: At the beginning of the training we randomly set a fraction of weights to zero and train the network with default parameters for 32k steps. We observe a sudden drop only if more than 99% of the parameters are set to zero.
  • ...and 3 more figures