Table of Contents
Fetching ...

Structured Pruning of Deep Convolutional Neural Networks

Sajid Anwar, Kyuyeon Hwang, Wonyong Sung

TL;DR

This work tackles the heavy computational and memory demands of CNNs by introducing structured sparsity at channel, kernel, and intra-kernel levels, guided by a particle-filter-based pruning strategy and followed by fixed-point optimization. It defines pruning granularities, employs a hybrid evolutionary particle filter to select pruning masks, and leverages convolution lowering to realize hardware-friendly speedups. The approach demonstrates substantial parameter and compute reductions with minimal accuracy loss on CIFAR-10 and MNIST, supported by 4–5 bit fixed-point representations for on-chip storage. The results indicate significant practical impact for embedded and hardware-accelerated deployments, enabling real-time CNN inference with reduced memory access and energy consumption.

Abstract

Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.

Structured Pruning of Deep Convolutional Neural Networks

TL;DR

This work tackles the heavy computational and memory demands of CNNs by introducing structured sparsity at channel, kernel, and intra-kernel levels, guided by a particle-filter-based pruning strategy and followed by fixed-point optimization. It defines pruning granularities, employs a hybrid evolutionary particle filter to select pruning masks, and leverages convolution lowering to realize hardware-friendly speedups. The approach demonstrates substantial parameter and compute reductions with minimal accuracy loss on CIFAR-10 and MNIST, supported by 4–5 bit fixed-point representations for on-chip storage. The results indicate significant practical impact for embedded and hardware-accelerated deployments, enabling real-time CNN inference with reduced memory access and energy consumption.

Abstract

Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.

Paper Structure

This paper contains 10 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Convolutional neural network with eight layers. Layers C1, C2 and C3 are the convolution layers while S2 and S4 constitute the two pooling layers. F6 and F7 are the two fully connected layers. This network can be represented with a string of 1-2-2-4-4-5-4-2 where each number denotes the count of feature maps in that layer [16].
  • Figure 2: (a) shows channel and filter wise pruning. The red dashed line shows channel level pruning. When we prune all the incoming filters to a feature map, all the outgoing kernels are also pruned. The blue dotted line depicts pruning kxk kernels. (b) Shows intra kernel level sparsity for both structured and unstructured cases. Kernel level pruning (blue dotted) is a special case of intra-kernel pruning, when the sparsity rate is$\mathbf{1 0 0 \%}$.
  • Figure 3: The top figure provides an example of convolution lowering idea introduced in [5][6]. (b) Our proposed idea constrains each outgoing convolution connection for a source feature map to have similar stride and offset. The offset shows the index of first pruned weight. The constraint is shown with the similar colored background squares. This significantly reduces the size of both features matrix and kernel matrix. The first 9 columns in row 1 of the input feature matrix changes from$2 \underline{2} \underline{1} 011 \underline{0}$ 2 to 21012 with the underlined elements pruned. Only the red colored elements in the feature maps and kernels survive and the rest are pruned. For this example, the size of feature matrix is reduced from $9 \times 27$ to $9 \times 15$ and the kernel matrix size is reduced from $27 \times 2$ to $15 \times 2$. (Better seen in color)
  • Figure 4: (a) shows the dotted matrices as the pruning masks for weights between two layers. The$x$ state vector represents this mask. In (b), one example of the state vector is provided. The circles represent the neurons while $w_{i j}$ shows the weight going from neuron $i$ to $j$.
  • Figure 5: This plot shows that directly training a small sized network cannot reach the performance level of a similar sized network obtained through pruning.
  • ...and 3 more figures