Table of Contents
Fetching ...

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen

TL;DR

This work tackles the challenge of simultaneously pruning and quantizing deep neural networks by introducing GETA, an automated, one-shot framework that jointly optimizes structured sparsity and mixed-precision quantization. Central to GETA are a Quantization-Aware Dependency Graph (QADG) for architecture-agnostic pruning, a Quantization-Aware Structured Sparse Optimizer (QASSO) that uses a partial projected SGD to enforce bit-width budgets, and a joint learning strategy that harmonizes pruning and quantization. Empirical results across CNNs, transformers, and vision-language models demonstrate competitive or superior performance and efficiency relative to existing joint pruning-quantization methods, with broad architectural generalization. The framework reduces engineering burden, provides white-box control over sparsity and bit-widths, and shows promise for practical deployment on varied hardware, paving the way for efficient, scalable model compression.

Abstract

Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

TL;DR

This work tackles the challenge of simultaneously pruning and quantizing deep neural networks by introducing GETA, an automated, one-shot framework that jointly optimizes structured sparsity and mixed-precision quantization. Central to GETA are a Quantization-Aware Dependency Graph (QADG) for architecture-agnostic pruning, a Quantization-Aware Structured Sparse Optimizer (QASSO) that uses a partial projected SGD to enforce bit-width budgets, and a joint learning strategy that harmonizes pruning and quantization. Empirical results across CNNs, transformers, and vision-language models demonstrate competitive or superior performance and efficiency relative to existing joint pruning-quantization methods, with broad architectural generalization. The framework reduces engineering burden, provides white-box control over sparsity and bit-widths, and shows promise for practical deployment on varied hardware, paving the way for efficient, scalable model compression.

Abstract

Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.

Paper Structure

This paper contains 19 sections, 3 theorems, 26 equations, 8 figures, 7 tables, 4 algorithms.

Key Result

Proposition 5.1

Let $\hat{\nabla}_x f$ be the full gradient of function $f(x,d,q_m,t)$ with respect to $x$. With forget rate $\gamma$ selection rule eq:forget.rate.rule and quantization step size $d$ selection rule eq:quant.step.size.rule, the search direction $s(x)$ is a descent direction for the function $f$ with

Figures (8)

  • Figure 1: GETA framework pipeline. Nodes Conv1 and Conv2 represent two convolutional layers, node BN represents batch normalization, and the "+" represents summation. For details on the remainder of the figures, see \ref{['sec:quantization']}--\ref{['sec:algorithm-description']}.
  • Figure 2: Figure 2(a) and 2(b) illustrate the Quantization-Aware dependency graph analysis for weight quantization and activation quantization, respectively. Figure 2(c) presents a dependency graph after QADG analysis. Concrete examples are provided in \ref{['appendix:dependency.graph']}.
  • Figure 3: Phi2-2.7B.
  • Figure 4: The \ref{['subfig:ablation.four.stage']} presents an ablation study evaluating the necessity of the four distinct stages of the QASSO optimizer using ResNet56 on the CIFAR10 benchmark and Phi2-2.7B on a common-sense task. The last two columns indicate the model's test accuracy. The \ref{['subfig:ablation.compression.limit']} illustrates the limits and boundaries of various compression techniques applied to ResNet56 on the CIFAR10 dataset.
  • Figure 5: Bert1 before performing quantization-aware dependency graph analysis.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Proposition 5.1
  • proof
  • Proposition A.1
  • proof
  • Proposition B.1
  • proof