Table of Contents
Fetching ...

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Chengxi Ye, Grace Chu, Yanfeng Liu, Yichi Zhang, Lukasz Lew, Li Zhang, Mark Sandler, Andrew Howard

TL;DR

The unified framework trains models at arbitrary precisions and sparsity levels with off-the-shelf recipes, enabling stable A1W1 and sub-1-bit networks where others falter and providing a theoretically grounded path to hyper-efficient neural networks.

Abstract

The discontinuous operations inherent in quantization and sparsification introduce a long-standing obstacle to backpropagation, particularly in ultra-low precision and sparse regimes. While the community has long viewed quantization as unfriendly to gradient descent due to its lack of smoothness, we pinpoint-for the first time-that the key issue is the absence of a proper gradient path that allows training to learn robustness to quantization noise. The standard Straight-Through Estimator (STE) exacerbates this with its well-understood mismatch: a quantization-aware forward pass but oblivious backward pass, leading to unmanaged error and instability. We solve this by explicitly modeling quantization as additive noise, making the full forward-backward path well-defined without heuristic gradient estimation. As one natural solution, we introduce a denoising dequantization transform derived from a principled ridge regression objective, creating an explicit, corrective gradient path that makes learning robust to the noise STE bypasses. We extend this to sparsification by treating it as a special form of quantization that zeros out small values. Our unified framework trains models at arbitrary precisions and sparsity levels with off-the-shelf recipes, enabling stable A1W1 and sub-1-bit networks where others falter. It yields state-of-the-art results, mapping efficiency frontiers for modern LLMs and providing a theoretically grounded path to hyper-efficient neural networks.

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

TL;DR

The unified framework trains models at arbitrary precisions and sparsity levels with off-the-shelf recipes, enabling stable A1W1 and sub-1-bit networks where others falter and providing a theoretically grounded path to hyper-efficient neural networks.

Abstract

The discontinuous operations inherent in quantization and sparsification introduce a long-standing obstacle to backpropagation, particularly in ultra-low precision and sparse regimes. While the community has long viewed quantization as unfriendly to gradient descent due to its lack of smoothness, we pinpoint-for the first time-that the key issue is the absence of a proper gradient path that allows training to learn robustness to quantization noise. The standard Straight-Through Estimator (STE) exacerbates this with its well-understood mismatch: a quantization-aware forward pass but oblivious backward pass, leading to unmanaged error and instability. We solve this by explicitly modeling quantization as additive noise, making the full forward-backward path well-defined without heuristic gradient estimation. As one natural solution, we introduce a denoising dequantization transform derived from a principled ridge regression objective, creating an explicit, corrective gradient path that makes learning robust to the noise STE bypasses. We extend this to sparsification by treating it as a special form of quantization that zeros out small values. Our unified framework trains models at arbitrary precisions and sparsity levels with off-the-shelf recipes, enabling stable A1W1 and sub-1-bit networks where others falter. It yields state-of-the-art results, mapping efficiency frontiers for modern LLMs and providing a theoretically grounded path to hyper-efficient neural networks.
Paper Structure (54 sections, 1 theorem, 14 equations, 9 figures, 13 tables)

This paper contains 54 sections, 1 theorem, 14 equations, 9 figures, 13 tables.

Key Result

Theorem 1

The result of a two-sided, channel-wise affine dequantization, $\tilde{Y}$, can be expressed as: where $n$ is the inner dimension size, the bar notation $\overline{(\cdot)}$ denotes the mean, variables with $X$ are column vectors (row-wise statistics), and variables with $W$ are column vectors (column-wise statistics) that are transposed where appropriate.

Figures (9)

  • Figure 1: Training Stability and Quantization Robustness Analysis. (a) Comparison of training loss on the Shakespeare dataset with 1-bit weights and activations (A1W1). Standard STE and BitNet fail to stabilize in this extreme regime, exhibiting divergence or high loss. In contrast, our approach converges smoothly, matching the stability of higher-precision baselines. (b) Comparison of Linear vs. Affine quantization schemes at A1W1. Standard STE (right bars) fails to utilize the additional expressivity of affine parameters, showing no improvement over linear quantization. Our method (left bars) robustly learns the affine parameters (scale and bias), achieving a significant accuracy jump (see Table \ref{['tab:rq_vs_ste']}/Appendix \ref{['sec:affine_vs_linear']}).
  • Figure 2: Storage Efficiency Frontiers: We map the trade-off between validation accuracy and effective bits-per-element (BPE). The frontier reveals that asymmetric quantization (e.g., A4W1) provides a superior storage-accuracy trade-off compared to symmetric settings (e.g., A2W2), as it preserves activation information while aggressively compressing static weights.
  • Figure 3: Approximate Energy Efficiency Frontier: Estimating compute cost (Activation Bits $\times$ Weight Bits $\times$ Sparsity). The results demonstrate a synergy between our estimator and structured sparsity: the (1:4 and 2:4) structured sparse A4W1 models reduce the compute cost of the dense equivalent while maintaining high accuracy, defining the efficiency frontier.
  • Figure 4: (a) Storage vs. Accuracy comparison between Gemma3 1B and 4B models. The quantized 4B model achieves higher accuracy than both BF16 and quantized versions of the 1B model. (b) Total Energy Cost vs. Accuracy. The quantized and sparse 4B model is both more accurate and more computationally efficient than a quantized 1B model.
  • Figure 5: Comparison of our denoising reconstruction method against STE across all experiment configurations, sorted by our method's performance. Our approach consistently yields higher accuracy, and the improvement is most pronounced at lower bit-widths. Notably, in the A1.5W1.5 channel-wise setting, STE fails to converge entirely.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof