The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Hengjie Cao; Zhendong Huang; Mengyi Chen; Yifeng Yang; Fanqi Yu; Ruijun Huang; Fang Dong; Xin Zhang; Jixian Zhou; Anrui Chen; Mingzhi Dong; Yujiang Wang; Jinlong Hou; Qin Lv; Yuan Cheng; Tun Lu; Fan Yang; Li Shang

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

TL;DR

Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

Abstract

Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

TL;DR

Abstract

Paper Structure (29 sections, 3 theorems, 64 equations, 5 figures, 1 table)

This paper contains 29 sections, 3 theorems, 64 equations, 5 figures, 1 table.

Introduction
Contributions.
Mean bias phenomenon
Structural Emergence of Mean Bias
Empirical Trajectory: Mean-Bias Energy Rises with Training and Depth.
Mechanism: A Three-Stage Causal Chain.
Mean Bias as the Dominant Source of Activation Outliers
Outlier Attribution.
High-Dimensional Extreme-Value Amplification.
Mean Bias-Aware Low-Bit Training Method
Notation and quantization operator.
Forward pass: activation mean--residual splitting.
Backward pass: output-gradient mean--residual splitting.
Computational profile.
Mean Bias-Aware Low-Bit Training Experiments
...and 14 more sections

Key Result

Theorem 1

Let an activation coordinate be modeled as where $\mu_j$ is a deterministic mean shift and $Z_{ij}$ is independent zero-mean sub-Gaussian noise with variance proxy $\sigma^2$. For a quantization threshold $t>0$ such that $t < |\mu_j|$, the probability that the coordinate magnitude exceeds the threshold satisfies

Figures (5)

Figure 1: Evidence for mean-bias coherence and spectral dominance. Left: projection-sign consistency across tokens. Right: dominant alignment of mean vector with the top right singular vector.
Figure 2: Mean-bias energy decomposition across layers and training stages, shown as compact summaries for early (10k) and late (170k) checkpoints.
Figure 3: Operator-level input/output energy comparison. Left: attention softmax at 10k. Right: FFN SwiGLU. Both operators increase mean-bias energy from input to output.
Figure 4: Activation outlier contribution shares (top 0.1% entries) across layers and training steps. Each row compares early (10k, left) and late (170k, right) for the same module.
Figure 5: Training loss curves for Qwen3-0.6B under BF16 and FP4 + Averis.

Theorems & Definitions (5)

Theorem 1: Elementwise Extreme Dominance
Theorem 2: Dense Extreme Amplification by Mean Bias
proof
Theorem 3: High-Dimensional Extreme-Value Separation (Sharper Lower Bound)
proof

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

TL;DR

Abstract

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)