Unified Stochastic Framework for Neural Network Quantization and Pruning
Haoyu Zhang, Rayan Saab
TL;DR
This work presents a unified stochastic framework for neural network compression that jointly addresses post-training quantization and pruning. Building on SPFQ, it introduces a scaling parameter $C$ and a general stochastic operator $\mathcal{T}$ to enable robust low-bit quantization and sparsity, with rigorous high-probability error bounds. The theoretical results establish Gaussian-dominated bounds for the accumulated quantization/pruning error at each layer and extend naturally to one-bit quantization and to joint quantization with pruning. The framework thus provides provable guarantees for post-training compression across quantization, pruning, and their combination, offering a scalable path to hardware-friendly neural networks with controlled accuracy loss.
Abstract
Quantization and pruning are two essential techniques for compressing neural networks, yet they are often treated independently, with limited theoretical analysis connecting them. This paper introduces a unified framework for post-training quantization and pruning using stochastic path-following algorithms. Our approach builds on the Stochastic Path Following Quantization (SPFQ) method, extending its applicability to pruning and low-bit quantization, including challenging 1-bit regimes. By incorporating a scaling parameter and generalizing the stochastic operator, the proposed method achieves robust error correction and yields rigorous theoretical error bounds for both quantization and pruning as well as their combination.
