Metis: Training LLMs with FP4 Quantization
Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang
TL;DR
Metis addresses the barrier posed by anisotropic singular-value spectra in weights, activations, and gradients to FP4 training of large language models by performing spectral-domain quantization. It partitions spectra into narrow sub-distributions and preserves dominant subspaces using sparse random sampling and random projection, enabling end-to-end W4A4G4 training with minimal fidelity loss. Empirically, Metis narrows the BF16 gap to 0.4% on LLaMA-3 8B (100B tokens) and outperforms Nvidia’s FP4 recipe in both fidelity and efficiency, demonstrating a practical path to ultra-low-bit training for state-of-the-art LLMs. The approach offers scalable spectral decomposition with negligible overhead and has potential to unlock more cost-effective, large-scale pretraining and fine-tuning workflows at very low precision.
Abstract
This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents Metis, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead, Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4% training loss gap and a 0.1% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses our implementation of Nvidia's recently announced (yet to be publicly released) FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead. The code implementation for Metis is available at: https://anonymous.4open.science/r/Metis-quantization-644B.
