Table of Contents
Fetching ...

Pruning and Quantization for Deep Neural Network Acceleration: A Survey

Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, Xiaotong Zhang

TL;DR

This survey addresses the challenge of deploying deep neural networks on resource-constrained devices by examining pruning and quantization as core compression techniques. It systematically classifies pruning into static and dynamic methods, surveys a range of pruning criteria and strategies, and surveys quantization approaches including PTQ and QAT, with emphasis on per-channel and low-bit methods. The paper also reviews deployment frameworks, hardware platforms, and compiler support, providing practical guidance and benchmarking insights. Collectively, pruning and quantization are shown to offer substantial speedups and storage reductions, with careful tuning and calibration enabling minimal or even positive impacts on accuracy in many cases. The findings highlight practical pathways for hardware-aware network compression and point to future work in automatic compression, broader architectural applicability, and integrated optimization across software and hardware stacks.

Abstract

Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks.

Pruning and Quantization for Deep Neural Network Acceleration: A Survey

TL;DR

This survey addresses the challenge of deploying deep neural networks on resource-constrained devices by examining pruning and quantization as core compression techniques. It systematically classifies pruning into static and dynamic methods, surveys a range of pruning criteria and strategies, and surveys quantization approaches including PTQ and QAT, with emphasis on per-channel and low-bit methods. The paper also reviews deployment frameworks, hardware platforms, and compiler support, providing practical guidance and benchmarking insights. Collectively, pruning and quantization are shown to offer substantial speedups and storage reductions, with careful tuning and calibration enabling minimal or even positive impacts on accuracy in many cases. The findings highlight practical pathways for hardware-aware network compression and point to future work in automatic compression, broader architectural applicability, and integrated optimization across software and hardware stacks.

Abstract

Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks.

Paper Structure

This paper contains 59 sections, 40 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: CNN Acceleration Approaches: Follow the sense from designing to implementing, CNN acceleration could fall into three categories, structure design (or generation), further optimization, and specialized hardware.
  • Figure 2: Separable Convolution: A standard convolution is decomposed into depth-wise convolution and point-wise convolution to reduce both the model size and computations.
  • Figure 3: Convolution Performance Optimization: From traditional convolution (dot squared) to image to column (im2col) - GEMM approach, adopted from Chellapilla2006. The red and green boxes indicate filter-wise and shape-wise elements, respectively.
  • Figure 4: Fully Connected Layer: Each node in a layer connects to all the nodes in the next layer, and every line corresponds to a weight value
  • Figure 5: Inception Block: The inception block computes multiple convolutions with one input tensor in parallel, which extends the receptive field by mixing the size of kernels. The yellow - brown coloured cubes are convolutional kernels sized 1, 3, and 5. The blue cube corresponds to a $3\times3$ pooling operation.
  • ...and 12 more figures