Table of Contents
Fetching ...

Keep the Gradients Flowing: Using Gradient Flow to Study Sparse Network Optimization

Kale-ab Tessera, Sara Hooker, Benjamin Rosman

TL;DR

The paper tackles the challenge of training sparse networks to reach the performance of dense models by moving beyond initialization to analyze how regularization, optimization, and architecture affect sparse learning. It introduces SC-SDC, a fair same-capacity framework for comparing sparse and dense networks, and Effective Gradient Flow (EGF), a gradient-flow measure that accounts for sparsity. Through extensive experiments on MLPs and CNNs across multiple datasets, the authors show that BatchNorm, activation choices (Swish/PReLU), and non-EWMA optimizers interact with gradient flow to significantly influence sparse network performance, with results extending to Wide ResNet-50 and magnitude pruning. The findings argue for a broader, optimization-tailored approach to sparsity, offering practical guidance for designing and training sparse architectures with improved efficiency and performance.

Abstract

Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization, optimization, and architecture choices on sparse models. We propose a simple experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC), that allows for a fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow, Effective Gradient Flow (EGF), that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and taking a wider view of tailoring optimization to sparse networks yields promising results.

Keep the Gradients Flowing: Using Gradient Flow to Study Sparse Network Optimization

TL;DR

The paper tackles the challenge of training sparse networks to reach the performance of dense models by moving beyond initialization to analyze how regularization, optimization, and architecture affect sparse learning. It introduces SC-SDC, a fair same-capacity framework for comparing sparse and dense networks, and Effective Gradient Flow (EGF), a gradient-flow measure that accounts for sparsity. Through extensive experiments on MLPs and CNNs across multiple datasets, the authors show that BatchNorm, activation choices (Swish/PReLU), and non-EWMA optimizers interact with gradient flow to significantly influence sparse network performance, with results extending to Wide ResNet-50 and magnitude pruning. The findings argue for a broader, optimization-tailored approach to sparsity, offering practical guidance for designing and training sparse architectures with improved efficiency and performance.

Abstract

Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization, optimization, and architecture choices on sparse models. We propose a simple experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC), that allows for a fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow, Effective Gradient Flow (EGF), that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and taking a wider view of tailoring optimization to sparse networks yields promising results.

Paper Structure

This paper contains 26 sections, 9 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: Same Capacity Sparse vs Dense Comparison (SC-SDC).SC-SDC is a simple framework that fairly compares sparse and dense networks. This is done by ensuring that the compared sparse and dense networks have the same number of active (nonzero) weights in each layer, and that these active weights are initially sampled from the same distribution.
  • Figure 2: Test Accuracy and Gradient Flow in Sparse and Dense MLPs. We study the effect of different regularization and optimization methods on test accuracy and average gradient flow, across different learning rates. We see that for Adam, a higher gradient flow tends to correlate to poor performance. The results for all optimizers can be found in Figures \ref{['fig:c100_diff_reg_all_optims_low_lr_all']} and \ref{['fig:c100_diff_reg_all_optims_high_lr_with_batchnorm']}.
  • Figure 3: Effect of Activation Functions on Accuracy and Gradient Flow on CIFAR-100, With a Low Learning Rate (0.001). We see that Swish is the most promising activation function across most optimizers. The results across all optimizers and learning rates are shown in Figure \ref{['fig:c100_diff_reg_all_optims_low_lr_acts']} and \ref{['fig:c100_diff_reg_all_optims_high_lr_acts']}.
  • Figure 4: Wide ResNet-50 Test Accuracy on CIFAR-100. We see that the results achieved on MLPs, using SC-SDC, are also consistent in CNNs. The densities range from 1% to 100% (fully dense) and the gradient flow results can be found in Figure \ref{['fig:wres_grad_flow']}.
  • Figure 5: Accuracy and Gradient Flow for Magnitude Pruning. We see that similarly to randomly pruned networks, magnitude pruned networks trained with Adam and $L2$ lead to high EGF and poor performance.
  • ...and 19 more figures