The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning
Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li
TL;DR
This work shows that the long-standing linear scaling laws for precision reduction fail in multi-hop reasoning due to two intertwined forces: hardware casting overhead and the sequential nature of reasoning that propagates quantization noise. It formalizes the Sustainability Index (SI) to jointly measure Trust, Economic Efficiency, and Environmental impact, and proves that a Quantization Trap is structurally inevitable when batch size is small and per-hop casting dominates. Through empirical study on Mistral-7B, Qwen-3-0.6B, GSM8K, and MathQA across L4, A100, and H100, the authors demonstrate non-monotonic energy and accuracy behavior under 8- and 4-bit quantization, including a pronounced energy spike and trust degradation. They also derive a Sequential Amortization Failure theorem and identify a Critical Batch Threshold that governs when low-bit quantization may or may not be beneficial, highlighting that hardware improvements alone cannot restore the traditional scaling law for complex reasoning tasks. The findings push for precision-aware scaling and mitigation strategies beyond brute-force compression, with implications for designing future AI hardware and evaluation frameworks.
Abstract
Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.
