SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity
Cyrus Zhou, Pedro Savarese, Zack Hassman, Vaughn Richard, Michael DiBrino, Michael Maire, Yanjing Li
TL;DR
The paper tackles the challenge of deploying ultra-low-precision neural networks on commodity hardware without sacrificing accuracy or requiring bespoke runtimes. It introduces SONIQ-QAT, a system-aware, noise-injected quantization-aware training method that learns per-channel mixed precision for weights and activations under hardware-calibrated perturbations, organized around a two-phase training process and a compact precision palette. Across CNNs and Transformers, SONIQ achieves up to 16× compression with parity or improvements in FP accuracy and delivers substantial end-to-end speedups on CPU and GPU vector units, demonstrating strong practicality for edge and data-center deployment. The work also reveals empirical insights on precision palettes, notably the effectiveness of two-level, bimodal configurations, and provides a concrete co-design pathway linking training-time quantization to deployment-time performance on mainstream hardware.
Abstract
Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit range and one in the 4--8-bit range -- suffice in practice; at inference, each channel selects one of the two, keeping kernels simple and fast. To our knowledge, SONIQ is the first framework to reach or surpass full-precision accuracy under ultra-low (1--4 bits per parameter) regimes while remaining deployable on commodity hardware, narrowing the gap between quantization theory and practical, high-throughput inference.
