SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions
Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song
TL;DR
The paper tackles the challenge of compressing large DNNs by proposing SQS, a unified Bayesian framework that jointly prunes and quantizes weights. It combines a spike-and-slab prior with a Gaussian Mixture Model to form a sparse, quantized sub-distribution within a variational posterior and derives a tractable objective for scalable optimization. A theoretical result guarantees that, under mild conditions, the variational posterior converges toward the true underlying target function, while empirical results show SQS achieves higher compression rates than baselines with competitive accuracy across ResNet, BERT-base, Llama3, and Qwen2.5, supported by ablations that validate the benefits of the spike-and-slab prior, Bayesian averaging, and outlier-aware windowing. Overall, SQS offers a practical and principled path to deploying highly compressed models on resource-constrained devices, with solid theoretical grounding for the joint pruning-quantization paradigm.
Abstract
Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.
