Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen
TL;DR
This work tackles the challenge of simultaneously pruning and quantizing deep neural networks by introducing GETA, an automated, one-shot framework that jointly optimizes structured sparsity and mixed-precision quantization. Central to GETA are a Quantization-Aware Dependency Graph (QADG) for architecture-agnostic pruning, a Quantization-Aware Structured Sparse Optimizer (QASSO) that uses a partial projected SGD to enforce bit-width budgets, and a joint learning strategy that harmonizes pruning and quantization. Empirical results across CNNs, transformers, and vision-language models demonstrate competitive or superior performance and efficiency relative to existing joint pruning-quantization methods, with broad architectural generalization. The framework reduces engineering burden, provides white-box control over sparsity and bit-widths, and shows promise for practical deployment on varied hardware, paving the way for efficient, scalable model compression.
Abstract
Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.
