Table of Contents
Fetching ...

SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song

TL;DR

The paper tackles the challenge of compressing large DNNs by proposing SQS, a unified Bayesian framework that jointly prunes and quantizes weights. It combines a spike-and-slab prior with a Gaussian Mixture Model to form a sparse, quantized sub-distribution within a variational posterior and derives a tractable objective for scalable optimization. A theoretical result guarantees that, under mild conditions, the variational posterior converges toward the true underlying target function, while empirical results show SQS achieves higher compression rates than baselines with competitive accuracy across ResNet, BERT-base, Llama3, and Qwen2.5, supported by ablations that validate the benefits of the spike-and-slab prior, Bayesian averaging, and outlier-aware windowing. Overall, SQS offers a practical and principled path to deploying highly compressed models on resource-constrained devices, with solid theoretical grounding for the joint pruning-quantization paradigm.

Abstract

Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.

SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

TL;DR

The paper tackles the challenge of compressing large DNNs by proposing SQS, a unified Bayesian framework that jointly prunes and quantizes weights. It combines a spike-and-slab prior with a Gaussian Mixture Model to form a sparse, quantized sub-distribution within a variational posterior and derives a tractable objective for scalable optimization. A theoretical result guarantees that, under mild conditions, the variational posterior converges toward the true underlying target function, while empirical results show SQS achieves higher compression rates than baselines with competitive accuracy across ResNet, BERT-base, Llama3, and Qwen2.5, supported by ablations that validate the benefits of the spike-and-slab prior, Bayesian averaging, and outlier-aware windowing. Overall, SQS offers a practical and principled path to deploying highly compressed models on resource-constrained devices, with solid theoretical grounding for the joint pruning-quantization paradigm.

Abstract

Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.

Paper Structure

This paper contains 36 sections, 7 theorems, 87 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $r_n^* = \left((L+1)s^*/n\right)\log N + (s^*/n \log(p\sqrt{n/s^*}))$, $\varepsilon_n^*=\sqrt{r_n^*}\log^\delta(n)$ for any $\delta > 1$, and $\xi_n^* = \inf_{\theta \in H(T, s, K), \left\|\theta\right\|\leq B} \left\|f_{\theta}-f_0\right\|_{\infty}^2$. Then, under mild conditions specified in t where $d(\cdot,\cdot)$ denotes the Hellinger distance, and $C$ and $C'$ are some constants.

Figures (8)

  • Figure 1: Our SQS method achieves high compression with minimal performance degradation by jointly pruning and quantizing model weights through variational learning. We employ a spike-and-GMM variational distribution to approximate full-precision weights: the spike component promotes sparsity for pruning, while the slab component (i.e., GMM) models a quantized weight distribution.
  • Figure 2:
  • Figure 3: Comparison of inference accuracy on the CIFAR-100 dataset using ResNet-18 (left) and ResNet-50 (right). Under the same number of Gaussian components, SQS with Bayesian averaging (in Equation \ref{['eq:infer-bam']}) results in a smaller accuracy drop compared to using a greedy approach (in Equation \ref{['eq:greedy']}).
  • Figure 4: Weight distribution of different layers in Qwen2.5 model (part 1).
  • Figure 5: Weight distribution of different layers in Qwen2.5 model (part 2).
  • ...and 3 more figures

Theorems & Definitions (14)

  • Theorem 1
  • proof : Sketch of Proof
  • Lemma 2: From Lemma 6.1 in cherief2018consistency
  • proof
  • Definition 1
  • Theorem 3
  • proof
  • Lemma 4
  • Lemma 5
  • proof
  • ...and 4 more