The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Henry Han; Xiyang Liu; Xiaodong Wang; Fei Han; Xiaodong Li

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li

TL;DR

This work shows that the long-standing linear scaling laws for precision reduction fail in multi-hop reasoning due to two intertwined forces: hardware casting overhead and the sequential nature of reasoning that propagates quantization noise. It formalizes the Sustainability Index (SI) to jointly measure Trust, Economic Efficiency, and Environmental impact, and proves that a Quantization Trap is structurally inevitable when batch size is small and per-hop casting dominates. Through empirical study on Mistral-7B, Qwen-3-0.6B, GSM8K, and MathQA across L4, A100, and H100, the authors demonstrate non-monotonic energy and accuracy behavior under 8- and 4-bit quantization, including a pronounced energy spike and trust degradation. They also derive a Sequential Amortization Failure theorem and identify a Critical Batch Threshold that governs when low-bit quantization may or may not be beneficial, highlighting that hardware improvements alone cannot restore the traditional scaling law for complex reasoning tasks. The findings push for precision-aware scaling and mitigation strategies beyond brute-force compression, with implications for designing future AI hardware and evaluation frameworks.

Abstract

Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

TL;DR

Abstract

Paper Structure (21 sections, 12 theorems, 28 equations, 4 figures)

This paper contains 21 sections, 12 theorems, 28 equations, 4 figures.

Introduction
Blind spots in LLM scaling laws
The Sustainability Index (SI) Framework
Mathematical Axiomatization of the Pillars
I. The Trust Sustainability ($T_{SI}$).
II. Economic Sustainability ($E_{SI}$)
III. The Energy Pillar ($S_{SI}$).
The Sustainability Manifold: Aggregation and Scaling Laws
The Scaling Monotonicity Hypothesis
Scaling Law Breaking in reasoning
Multi-Hop Reasoning and casting overhead
Logical vs. Atomic Hops.
Casting Overhead Ratio (COR):
Evaluation Models, Data and Hardware
Sustainability Inversion: the Quantization Trap
...and 6 more sections

Key Result

Proposition 3.2

Let $SI(\theta)$ be a function of precision (bit-width) $p$ for a configuration $\theta \in \Theta$. A Quantization Trap is identified when: $\frac{\partial SI}{\partial p} > 0$ signifying a fundamental breakdown of the linear scaling law. (proof in appendix)

Figures (4)

Figure 1: Sustainability Inversion (Mistral-7B, GSM8K).(A) Physical Telemetry: Reducing precision to 8/4-bit across L4/A100/H100 triggers non-monotonic throughput collapse and energy spikes; FP16 is more efficient due to software-emulated casting overhead. (B) Sustainability Manifold: Indices ($T_{SI}, E_{SI}, S_{SI}$) show low-bit configurations are Pareto-dominated. 8-bit represents systemic failure, while 4-bit marks a "Quantization Trap" with a 31.1% global SI deficit relative to the FP16 anchor
Figure 2: The Size Paradox (Qwen3-0.6B). (A) 4-bit quantization triggers universal reasoning collapse regardless of architecture. (B) High 8-bit COR values (2.5–2.8$\times$) prove casting overhead dominance is the mechanical driver of inefficiency. (C) Telemetry reveals that low-precision "optimization" paradoxically results in a 400% energy penalty
Figure 3: Sustainability Inversion of Qwen3-0.6B reasoning on MathQA(a–c) Physical telemetry reveals throughput collapse and energy spikes; FP16 is $4.2\times$ more efficient than 8-bit on H100. (d–f) Sustainability indices expose systemic failure ($SI \approx 0.55$) in quantized models due to unamortized casting overhead. FP16 strictly Pareto-dominates all configurations across L4, A100, and H100, proving brute-force bit-reduction constitutes a Quantization Trap
Figure 4: Cross-Architectural Trap Evidence. (A) Throughput analysis locates $B^* \approx 64$ for Falcon3, while Mistral-7B remains terminally trapped ($B^* > 128$). (B) $COR > 1.0$ identifies Casting Dominance as the mechanical bottleneck. (C) A scale-invariant $\sim$30% logic collapse proves that batch-driven efficiency gains fail to restore reasoning trust.

Theorems & Definitions (19)

Remark 3.1
Proposition 3.2: Scaling Law Divergence
Lemma 4.1: Average Latency per hop
Proposition 4.2: $COR$ Approximation
Theorem 4.3: Sequential Amortization Failure in multi-hop reasoning
Theorem 4.4: Scaling Law Divergence in Multi-Hop Reasoning
Theorem 4.5: Amortization-Trust Decoupling
Proposition A.1: Scaling Law Divergence
proof
Lemma A.2: Average Latency per hop
...and 9 more

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

TL;DR

Abstract

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (19)