Table of Contents
Fetching ...

Metis: Training LLMs with FP4 Quantization

Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang

TL;DR

Metis addresses the barrier posed by anisotropic singular-value spectra in weights, activations, and gradients to FP4 training of large language models by performing spectral-domain quantization. It partitions spectra into narrow sub-distributions and preserves dominant subspaces using sparse random sampling and random projection, enabling end-to-end W4A4G4 training with minimal fidelity loss. Empirically, Metis narrows the BF16 gap to 0.4% on LLaMA-3 8B (100B tokens) and outperforms Nvidia’s FP4 recipe in both fidelity and efficiency, demonstrating a practical path to ultra-low-bit training for state-of-the-art LLMs. The approach offers scalable spectral decomposition with negligible overhead and has potential to unlock more cost-effective, large-scale pretraining and fine-tuning workflows at very low precision.

Abstract

This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents Metis, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead, Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4% training loss gap and a 0.1% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses our implementation of Nvidia's recently announced (yet to be publicly released) FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead. The code implementation for Metis is available at: https://anonymous.4open.science/r/Metis-quantization-644B.

Metis: Training LLMs with FP4 Quantization

TL;DR

Metis addresses the barrier posed by anisotropic singular-value spectra in weights, activations, and gradients to FP4 training of large language models by performing spectral-domain quantization. It partitions spectra into narrow sub-distributions and preserves dominant subspaces using sparse random sampling and random projection, enabling end-to-end W4A4G4 training with minimal fidelity loss. Empirically, Metis narrows the BF16 gap to 0.4% on LLaMA-3 8B (100B tokens) and outperforms Nvidia’s FP4 recipe in both fidelity and efficiency, demonstrating a practical path to ultra-low-bit training for state-of-the-art LLMs. The approach offers scalable spectral decomposition with negligible overhead and has potential to unlock more cost-effective, large-scale pretraining and fine-tuning workflows at very low precision.

Abstract

This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents Metis, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead, Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4% training loss gap and a 0.1% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses our implementation of Nvidia's recently announced (yet to be publicly released) FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead. The code implementation for Metis is available at: https://anonymous.4open.science/r/Metis-quantization-644B.

Paper Structure

This paper contains 30 sections, 6 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: Overview of anisotropy and its impact on quantization, illustrated using a gradient matrix from LLaMA-3 8B. (A) Singular value spectrum exhibits strong anisotropy, with a few singular values dominating the spectrum. (B) The wide matrix distribution arises from the superposition of singular components: large components (e.g., i=0) drive the high-value region, while small components concentrate near zero. (C) Quantization bias disproportionately rounds many small values to zero. (D–E) In spectral space, smaller singular components incur substantially larger relative quantization errors in singular values and more severe perturbations in singular directions. Details corresponding to (A) in Section \ref{['analysis:anisotropy']}, (B) in Section \ref{['analysis:wide-dist']}, and (C–E) in Section \ref{['analysis:bias']}.
  • Figure 2: Analysis of weight, activation, and gradient matrices (layer 32, FeedForward(FFN)). (A) The singular value spectra exhibit strong anisotropy, with only 0.63%, 3.15%, and 2.91% of components (identified by the elbow point of maximum curvature) dominating the spectrum. (B) Filled regions denote full-matrix distributions; dashed histograms showes selected rank-1 components ($\mathbf{u}_i \sigma_i \mathbf{v}_i^\top$ for $i=0,16,128,1024$). Dominant components (e.g., $i=0$) drive the high-value region, while smaller ones (e.g., $i=1024$) contribute near zero. See \ref{['appendix:singular-spectrum']} for additional results.
  • Figure 3: Analysis of weight, activation, and gradient matrices with hidden dimension 4096 (layer 32, FFN). (A) Left singular vector distributions: all exhibit similar shapes with widths much smaller than that of the full matrix. (B) Yellow regions show the residuals after removing the top 128 components ($3\%\times 4096 \approx 123$, rounded to the nearest power of two), while the grey region represents the original matrix distribution. The residuals are one to two orders of magnitude narrower than the full matrix, confirming that wide ranges originate from dominant components. More results in \ref{['appendix:singular-vector']}
  • Figure 4: Subspace alignment between the dominant subspace of the full batch and that of randomly sampled subsets of sequences. Alignment quickly saturates as the sample ratio increases, with just 1% of sequences achieving nearly 0.9 alignment with the full-batch subspace. More results in \ref{['appendix:One-Sequence']}.
  • Figure 5: Training loss curves for (A) GPT-2 130M, (B) GPT-2 1.1B, and (C) LLaMA-3 8B. Direct NVFP4 incurs a loss gap of 3–4% relative to BF16 baseline, while Metis reduces the gap to 0.4% on LLaMA-3 and even slightly surpasses the BF16 baseline on GPT-2 models. This may be attributed to the separation of low-rank and residual branches in weight matrices, which reduces interference between feature subspaces.
  • ...and 22 more figures