Table of Contents
Fetching ...

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li

TL;DR

This work shows that the long-standing linear scaling laws for precision reduction fail in multi-hop reasoning due to two intertwined forces: hardware casting overhead and the sequential nature of reasoning that propagates quantization noise. It formalizes the Sustainability Index (SI) to jointly measure Trust, Economic Efficiency, and Environmental impact, and proves that a Quantization Trap is structurally inevitable when batch size is small and per-hop casting dominates. Through empirical study on Mistral-7B, Qwen-3-0.6B, GSM8K, and MathQA across L4, A100, and H100, the authors demonstrate non-monotonic energy and accuracy behavior under 8- and 4-bit quantization, including a pronounced energy spike and trust degradation. They also derive a Sequential Amortization Failure theorem and identify a Critical Batch Threshold that governs when low-bit quantization may or may not be beneficial, highlighting that hardware improvements alone cannot restore the traditional scaling law for complex reasoning tasks. The findings push for precision-aware scaling and mitigation strategies beyond brute-force compression, with implications for designing future AI hardware and evaluation frameworks.

Abstract

Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

TL;DR

This work shows that the long-standing linear scaling laws for precision reduction fail in multi-hop reasoning due to two intertwined forces: hardware casting overhead and the sequential nature of reasoning that propagates quantization noise. It formalizes the Sustainability Index (SI) to jointly measure Trust, Economic Efficiency, and Environmental impact, and proves that a Quantization Trap is structurally inevitable when batch size is small and per-hop casting dominates. Through empirical study on Mistral-7B, Qwen-3-0.6B, GSM8K, and MathQA across L4, A100, and H100, the authors demonstrate non-monotonic energy and accuracy behavior under 8- and 4-bit quantization, including a pronounced energy spike and trust degradation. They also derive a Sequential Amortization Failure theorem and identify a Critical Batch Threshold that governs when low-bit quantization may or may not be beneficial, highlighting that hardware improvements alone cannot restore the traditional scaling law for complex reasoning tasks. The findings push for precision-aware scaling and mitigation strategies beyond brute-force compression, with implications for designing future AI hardware and evaluation frameworks.

Abstract

Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.
Paper Structure (21 sections, 12 theorems, 28 equations, 4 figures)

This paper contains 21 sections, 12 theorems, 28 equations, 4 figures.

Key Result

Proposition 3.2

Let $SI(\theta)$ be a function of precision (bit-width) $p$ for a configuration $\theta \in \Theta$. A Quantization Trap is identified when: $\frac{\partial SI}{\partial p} > 0$ signifying a fundamental breakdown of the linear scaling law. (proof in appendix)

Figures (4)

  • Figure 1: Sustainability Inversion (Mistral-7B, GSM8K).(A) Physical Telemetry: Reducing precision to 8/4-bit across L4/A100/H100 triggers non-monotonic throughput collapse and energy spikes; FP16 is more efficient due to software-emulated casting overhead. (B) Sustainability Manifold: Indices ($T_{SI}, E_{SI}, S_{SI}$) show low-bit configurations are Pareto-dominated. 8-bit represents systemic failure, while 4-bit marks a "Quantization Trap" with a 31.1% global SI deficit relative to the FP16 anchor
  • Figure 2: The Size Paradox (Qwen3-0.6B). (A) 4-bit quantization triggers universal reasoning collapse regardless of architecture. (B) High 8-bit COR values (2.5–2.8$\times$) prove casting overhead dominance is the mechanical driver of inefficiency. (C) Telemetry reveals that low-precision "optimization" paradoxically results in a 400% energy penalty
  • Figure 3: Sustainability Inversion of Qwen3-0.6B reasoning on MathQA(a–c) Physical telemetry reveals throughput collapse and energy spikes; FP16 is $4.2\times$ more efficient than 8-bit on H100. (d–f) Sustainability indices expose systemic failure ($SI \approx 0.55$) in quantized models due to unamortized casting overhead. FP16 strictly Pareto-dominates all configurations across L4, A100, and H100, proving brute-force bit-reduction constitutes a Quantization Trap
  • Figure 4: Cross-Architectural Trap Evidence. (A) Throughput analysis locates $B^* \approx 64$ for Falcon3, while Mistral-7B remains terminally trapped ($B^* > 128$). (B) $COR > 1.0$ identifies Casting Dominance as the mechanical bottleneck. (C) A scale-invariant $\sim$30% logic collapse proves that batch-driven efficiency gains fail to restore reasoning trust.

Theorems & Definitions (19)

  • Remark 3.1
  • Proposition 3.2: Scaling Law Divergence
  • Lemma 4.1: Average Latency per hop
  • Proposition 4.2: $COR$ Approximation
  • Theorem 4.3: Sequential Amortization Failure in multi-hop reasoning
  • Theorem 4.4: Scaling Law Divergence in Multi-Hop Reasoning
  • Theorem 4.5: Amortization-Trust Decoupling
  • Proposition A.1: Scaling Law Divergence
  • proof
  • Lemma A.2: Average Latency per hop
  • ...and 9 more