Table of Contents
Fetching ...

Memory Bounded Deep Convolutional Networks

Maxwell D. Collins, Pushmeet Kohli

TL;DR

High-level problem: deep CNNs have prohibitive memory costs for deployment on constrained devices. The paper proposes sparsity-inducing regularizers that enforce sparse connectivity during training, including ℓ1 shrinkage and a direct ℓ0 projection, implemented within SGD and extended in Caffe with a layer-wise sparsity strategy. Key findings show substantial memory reductions (e.g., near 4x for AlexNet) with minimal accuracy loss on MNIST, CIFAR-10, and ImageNet, plus benefits from ensembles and robustness under reduced training data. The work provides a practical pathway to memory-efficient, high-performing CNNs suitable for resource-limited environments and scalable to large datasets.

Abstract

In this work, we investigate the use of sparsity-inducing regularizers during training of Convolution Neural Networks (CNNs). These regularizers encourage that fewer connections in the convolution and fully connected layers take non-zero values and in effect result in sparse connectivity between hidden units in the deep network. This in turn reduces the memory and runtime cost involved in deploying the learned CNNs. We show that training with such regularization can still be performed using stochastic gradient descent implying that it can be used easily in existing codebases. Experimental evaluation of our approach on MNIST, CIFAR, and ImageNet datasets shows that our regularizers can result in dramatic reductions in memory requirements. For instance, when applied on AlexNet, our method can reduce the memory consumption by a factor of four with minimal loss in accuracy.

Memory Bounded Deep Convolutional Networks

TL;DR

High-level problem: deep CNNs have prohibitive memory costs for deployment on constrained devices. The paper proposes sparsity-inducing regularizers that enforce sparse connectivity during training, including ℓ1 shrinkage and a direct ℓ0 projection, implemented within SGD and extended in Caffe with a layer-wise sparsity strategy. Key findings show substantial memory reductions (e.g., near 4x for AlexNet) with minimal accuracy loss on MNIST, CIFAR-10, and ImageNet, plus benefits from ensembles and robustness under reduced training data. The work provides a practical pathway to memory-efficient, high-performing CNNs suitable for resource-limited environments and scalable to large datasets.

Abstract

In this work, we investigate the use of sparsity-inducing regularizers during training of Convolution Neural Networks (CNNs). These regularizers encourage that fewer connections in the convolution and fully connected layers take non-zero values and in effect result in sparse connectivity between hidden units in the deep network. This in turn reduces the memory and runtime cost involved in deploying the learned CNNs. We show that training with such regularization can still be performed using stochastic gradient descent implying that it can be used easily in existing codebases. Experimental evaluation of our approach on MNIST, CIFAR, and ImageNet datasets shows that our regularizers can result in dramatic reductions in memory requirements. For instance, when applied on AlexNet, our method can reduce the memory consumption by a factor of four with minimal loss in accuracy.

Paper Structure

This paper contains 22 sections, 8 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Exploring the trade-off between accuracy and model size achieved by our method for different problems. In the first row, each point plotted is a candidate network considered by the greedy search in Section \ref{['sec:greedy-layerwise']}. The $x$ axis shows the memory required to store the weights of that network, using the best choice of the storage formats described in Appendix \ref{['sec:memory']}. In the table we show the same for a network trained on ILSVRC 2012, as described in Section \ref{['sec:implementation']}.
  • Figure 2: Convolution filters learned on the CIFAR-10 classification task. The non-sparse kernels on the left come from a baseline model using classical weight decay as regularization. Incorporating a sparsity-inducing $\ell_1$ shrinkage operator during training yields the sparse filters on the middle right, and 20 all-zero filters not shown. The pixel-wise nonzero pattern of the sparse filters is shown on the far right.
  • Figure 3: This figure plots the progress of the stochastic gradient optimization. We plot 10 repetitions of each method, we found that the optimization consistently converged when using any of the sparsity updates presented in this work.
  • Figure 4: These plots show the distribution of nonzero parameters determined by a greedy procedure seeking to maximize the accuracy of sparse networks for CIFAR-10. Each stack of boxes corresponds to a single network, and is centered on the accuracy for that network. The plot on the left directly counts the number of nonzeros in the layer. The plot on the right shows the same networks, but normalizes the height of the boxes such that each layer's box would be the same height for a dense network. "conv1-3" are the convolution layers, while "fc1-2" are the fully-connected layers.
  • Figure 5: While the individual updates used in the optimization under $\ell_0$-projection are a sort of thresholding operator, it behaves very differently from simply thresholding the model. We see that allowing the model to optimize under an $\ell_0$ constraint improves the end test accuracy as compared to simple thresholding.
  • ...and 2 more figures