Table of Contents
Fetching ...

SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity

Cyrus Zhou, Pedro Savarese, Zack Hassman, Vaughn Richard, Michael DiBrino, Michael Maire, Yanjing Li

TL;DR

The paper tackles the challenge of deploying ultra-low-precision neural networks on commodity hardware without sacrificing accuracy or requiring bespoke runtimes. It introduces SONIQ-QAT, a system-aware, noise-injected quantization-aware training method that learns per-channel mixed precision for weights and activations under hardware-calibrated perturbations, organized around a two-phase training process and a compact precision palette. Across CNNs and Transformers, SONIQ achieves up to 16× compression with parity or improvements in FP accuracy and delivers substantial end-to-end speedups on CPU and GPU vector units, demonstrating strong practicality for edge and data-center deployment. The work also reveals empirical insights on precision palettes, notably the effectiveness of two-level, bimodal configurations, and provides a concrete co-design pathway linking training-time quantization to deployment-time performance on mainstream hardware.

Abstract

Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit range and one in the 4--8-bit range -- suffice in practice; at inference, each channel selects one of the two, keeping kernels simple and fast. To our knowledge, SONIQ is the first framework to reach or surpass full-precision accuracy under ultra-low (1--4 bits per parameter) regimes while remaining deployable on commodity hardware, narrowing the gap between quantization theory and practical, high-throughput inference.

SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity

TL;DR

The paper tackles the challenge of deploying ultra-low-precision neural networks on commodity hardware without sacrificing accuracy or requiring bespoke runtimes. It introduces SONIQ-QAT, a system-aware, noise-injected quantization-aware training method that learns per-channel mixed precision for weights and activations under hardware-calibrated perturbations, organized around a two-phase training process and a compact precision palette. Across CNNs and Transformers, SONIQ achieves up to 16× compression with parity or improvements in FP accuracy and delivers substantial end-to-end speedups on CPU and GPU vector units, demonstrating strong practicality for edge and data-center deployment. The work also reveals empirical insights on precision palettes, notably the effectiveness of two-level, bimodal configurations, and provides a concrete co-design pathway linking training-time quantization to deployment-time performance on mainstream hardware.

Abstract

Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit range and one in the 4--8-bit range -- suffice in practice; at inference, each channel selects one of the two, keeping kernels simple and fast. To our knowledge, SONIQ is the first framework to reach or surpass full-precision accuracy under ultra-low (1--4 bits per parameter) regimes while remaining deployable on commodity hardware, narrowing the gap between quantization theory and practical, high-throughput inference.
Paper Structure (46 sections, 11 equations, 9 figures, 7 tables)

This paper contains 46 sections, 11 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: End-to-end SONIQ-QAT workflow. From left to right: (1) enumerate the bit-widths natively supported by the target accelerator and assemble them into a precision palette; (2) use a temperature-annealed soft-assignment to choose a palette entry for every channel; (3) quantize weights and activations with hardware-calibrated noise drawn from the selected precisions, allowing joint optimization of parameters and bit-widths during back-propagation.
  • Figure 2: SONIQ-QAT inference-aware normalization: group channels by bit-width into uniform-precision blocks, then pad each block to the nearest SIMD width to maximize vector utilization without changing results.
  • Figure 3: Average weight–activation precision vs. accuracy for networks with k = 2 and k = 3 precision levels; k = 2 already achieves the optimal accuracy–compression tradeoff.
  • Figure 4: Accuracy vs. bits/param for SONIQ with fixed palettes. A single $>$4-bit option preserves full-precision accuracy -- power-of-two palettes incur negligible loss.
  • Figure 5: Inference latency breakdown of the transformer trained with SONIQ. Here, $bs$ denotes batch size and $sl$ the (equal) source- and target-sequence length. We apply weight–activation quantization to “Linear’’ (all non-attention matmul layers except the final projection) and weight-only quantization to the final dense layer; greyed operators remain unmodified.
  • ...and 4 more figures